fj 


Femme 
eid 


QU AND ANALYSIS 


OF EXPERIMENTS IN 
-— » 


xSANIPU 


PSYCHOLOGY AND EDUCATION 


&. F Lindquist - stare UNIVERSITY OF IOWA 


HOUGHTON MIFFLIN COMPANY: BOSTON 


The Riversive Press Cambridge 


1956 IMPRESSION 


3.8.8.9, V 3. » 


COPYRIGHT, 1953, BY E. F. LINDQUIST 
ALL RIGHTS RESERVED, INCLUDING THE RIGHT 
TO REPRODUCE THIS BOOK OR PARTS THEREOF 
IN ANY FORM AX, The Riberside Press 
CAMBRIDGE, MASSACHUSETTS 
PRINTED IN THE U.S.A. 


Preface 


his book is designed primarily as a teaching instrument, and learning 
aid. Its purpose is to help students and research workers in psychology and 
education learn how to select or devise appropriate designs for the experi- 
ments they may have occasion to perform, and to analyze and interpret 
properly the results obtained through the use of those designs. 


Either of two contrasting procedures could be followed in the preparation 
of a text with this purpose. One is to present a number of “standard” designs, 
selected for maximum usefulness in these fields, and to provide for each a 
worked example of its application to a specific problem, together with step- 
by-step directions for the computational work involved, with general rules 
for the interpretation of results, and with suggestions concerning the types 
of problems with which the designs may be employed. Under this procedure, 
the student would be asked to take the underlying theory for granted, and 
only a minimum demand would be made on his mathematical skills and abili- 
ties. This might be described as a “cook book" type of text — consisting of 
“recipes” for the use of a few basic designs — or as a "follow the leader" 
type of text, implying that the user follows blindly the examples set for him, 
without understanding the reasons for what he does. 


The other possibility is to attempt to develop in the student a genuine 
understanding of the basic principles of experimental design and analysis, so 
that he may be capable of devising for himself the particular design needed 
in any specific situation, or of modifying or combining the commonly used 
or “standard” designs and analytical procedures and of qualifying the 
inferences drawn as specific conditions may demand. 


The limitations and dangers of the first of these procedures in the training 
of research workers are fairly evident. The specific designs needed in psy- 
chological research in general are so many and so varied that it is quite 
impracticable in a single text to provide fully developed specific illustrations 
or “model solutions” for more than a small proportion of the total number 

v 


vi PREFACE 


needed. The student trained under the first of these procedures will tend 
to use only those designs for which “model” examples have been provided, 
to select the one of these designs for which the illustrative problem “seems 
most like" the particular problem with which he is concerned, and to analyze 
and interpret the results exactly as in the illustrative problem. Lacking any 
genuine understanding of the principles involved, or any adequate appreci- 
ation of underlying assumptions, he will fail to recognize important differ- 
ences between his own situation and the “model” situation, and will there- 
fore frequently select inappropriate designs or draw wrong inferences from 
the results obtained. A large proportion of the many mistakes that have 
been made in educational and psychological research in the past have thus 
been due to attempts to apply intact to a new field or problem a design or 
technique which has been successfully used in or with another, without 
recognizing the many subtle but fundamental differences in the situations 
involved. 


For these reasons, the latter of these two procedures will be that followed 
in this text. With very few exceptions, the rule will be observed of present- 
ing all of the essential mathematical and logical basis of the designs and 
analytical procedures considered, but of presenting this basis in a form that 
will place it well within the grasp of the typical student and research worker 
in these fields. Specifically, it will be assumed that the typical student 
using this text will have had only a single introductory course in applied 
statistics — including the binomial, normal, and / distributions as applied 
in sampling error theory to simple random samples and simple (product 
moment) correlation theory — and that his formal training in mathematics 
may not have gone beyond a year's course in high school algebra, or does 
not include calculus. Because of these limitations, it will of course be neces- 
sary to make some exceptions to the rule just stated. It will be necessary 
to ask the student to take for granted, without proof, the derivations of the 
basic sampling distributions (such as the X? and F-distributions) and the 
probability tables based on these distributions. Occasionally, to save time 
and space, proofs will also be omitted for some of the less important and inci- 
dental techniques employed, even though relatively simple proofs for these 
techniques could be presented. 


Because of the student's background, it will be necessary in most cases 
to employ proofs that are considerably more cumbersome and indirect, and in 
some instances less rigorous, than those that the mathematical statistician 
would prefer. For the purpose of insuring that the student will later make 
most intelligent use of these designs, however, the important considerations 
are that the proofs employed be essentially sound, that they stress all of 
the important assumptions necessary in the most rigorous proofs, that they 
be meaningful to the student, and that they be convincing or “intuitively” 
acceptable to him. Again, in consideration of the student's background, the 


PREFACE vi 


explanations and discussions accompanying the mathematical derivations 
may at times seem somewhat verbose to the trained reader who may employ 
this text for reference purposes. The author has no apologies to offer for this 
feature. On the contrary, experience has shown that the student very fre- 
quently needs and appreciates considerable repetition or restatement of im- 
portant concepts and explanations. 


A special effort will be made in this text to develop in the student a keen 
awareness and thorough understanding of the assumptions underlying the 
various tests of significance and of the consequences of failure to satisfy 
these assumptions, with specific reference to the situations in and materials 
with which the educational and psychological research worker must contend. 
A special effort will also be made to clarify to the student the considerations 
involved in the selection of valid “error” terms and in the use of “pooled” 
error terms. These are matters which have often been seriously neglected in 
discussions of experimental design, and this neglect has in large part accounted 
for the frequent use of inappropriate designs and the drawing of wrong 
inferences from experimental results. 


A special feature of this text is the reliance placed upon the study exercises 
at the end of each chapter. These are intended to develop in the student 
a genuinely functional understanding of the designs and techniques pre- 
sented, and to induce him to take an aggressive rather than a passive learning 
attitude. It is common practice in the writing of textbooks of this kind to 
introduce concrete illustrations into the discussions as early as possible, 
and to present most of the logic in specific terms of these illustrations. This 
often has the advantage of making the ideas initially easier to grasp, but 
also often has the disadvantage of associating the ideas presented too closely 
with the particular illustration employed, so that the student has difficulty 
later in generalizing the concepts gained, or of divorcing them from the 
unique textbook illustrations and, of "translating" them into the terms 
of new applications. In this text, the policy will be followed of presenting 
each design and procedure initially in rather highly generalized terms, and 
of then leaving it to the student himself to make these generalizations more 
specific and meaningful in terms of the concrete illustrations suggested in 
the study exercises. Care will be taken in these exercises to provide a suffi- 
cient number of leading questions that the student should meet with no 
inordinate difficulties in making the applications. The instructor is of course 
free to select those exercises that are most closely related to the peculiar 
interests of his students, or to devise similar exercises that will be more 
satisfactory in this respect. 


Because of the policy of employing highly generalized and logical initial 
presentations, instead of relying upon specific illustrations introduced early 
or in a cook book fashion, this text will prove relatively difficult to read, 


viii PREFACE 


particularly upon first reading or before consideration of the study exercises. 
It has been the author's experience, however, that the total difficulty for 
the student of securing a genuine understanding of experimental design and 
analysis is reduced rather than increased by this type of presentation. It 
should be emphasized that the study exercises are intended to constitute an 
integral part of the total presentation, and that the textual part of this 
text cannot fairly be evaluated alone apart from the manner in which it is 
to be used with the study exercises in instruction and learning. 


An effort has been made to base the study exercises, wherever possible, 
upon actual applications reported in the research literature. In many in- 
stances, however, it was necessary to modify the reported descriptions or 
data so as to adapt them more specifically to the purposes of the study exer- 
cises. The study exercises, therefore are not to be regarded as containing any 
i nplications concerning the validity or adequacy of the research studies cited. 


The content of this text follows a close, logical organization. Each new 
step taken in the logical development rests upon those taken earlier. It is 
especially important, therefore, that the student master thoroughly the 
logic of the simpler designs considered in the early chapters, particularly 
Chapters 3-9. Some of the simpler designs find relatively few practical 
applications, but all are essential for developmental purposes. The student 
is warned that until he feels that he has acquired a complete understanding 
of these simpler designs, he can hardly expect to comprehend at all the more 
complex designs and perhaps more practical designs which follow. He is 
assured, however, that once he has a genuine understanding of the basic 
principles presented in these earlier chapters, he will find that he can move 
ahead with surprising ease through the later chapters in the text. 


This text has been developed through the use and revision of three suc- 
cessive multilithed editions with the author's classes. The help the author 
received from his students in the revision and improvement of these ma- 
terials was of inestimable value. Special acknowledgement for such help is 
due to Dr. Dorothy Sherman, for her most exceptional thoroughness in detect- 
ing typographical errors, slips, and inadequacies in presentation, and to Rita 
Senf for her help in eliminating many infelicities in style and diction. 


The entire manuscript (excluding the study exercises) was read for the 
publishers by W. G. Cochran, of Johns Hopkins University, and by David 
Grant, of the University of Wisconsin, and both made their comments avail- 
able to the writer in time for a final revision of the manuscript before publi- 
cation. Both readers read the manuscript with unusual care, and the author 
was enabled to make many important improvements in presentation as well 
as to correct several errors in theory on the basis of these comments. The 
author gratefully acknowledges the very valuable help thus secured from the 
publishers and their readers. 


PREFACE ix 


Major acknowledgement is due to two of the author's students and research 
assistants, Dee W. Norton and Leonard Feldt. Dee Norton made a very time- 
consuming and painstaking search of the research literature in psychology 
and education for illustrative applications of the various designs, and was 
primarily responsible for the construction of the study exercises based on these 
illustrations. Dr. Norton's study of the effects upon the F-distribution of 
non-normality and heterogeneity of variance (pages 79-90) constitutes a real 
contribution to better understanding of the techniques considered in this text. 
Leonard Feldt took full responsibility for reading galley and page proof, and 
for compilation of the index. 


The author is indebted to Professor Sir Ronald A. Fisher, Cambridge, to 
Dr. Frank Yates, Rothamsted, and to Messrs. Oliver and Boyd Limited, 
Edinburgh, for permission to reprint Tables 1 and 2, the Appendix Table, 
and part of Table 3, from their books Statistical Tables for Biological, Agricul- 
tural, and Medical Research and Statistical Methods for Research Workers. 


E. F. LINDQUIST 


Contents 


1 Introduction: Fundamental Concepts and Basic Designs 


The Nature and Purpose of Educational and Psychological 
Experiments in General 


The Importance of Measures of Precision 

Testing Hypotheses 

The Essential Characteristics of a Good Experimental Design 
Basic Experimental Designs 

Basic Types of Error 

The Principle of Randomization 


Illustrative Applications of Basic Designs 


The Chi-Square, t, and F Distributions 
Introduction 
The Chi-Square Distribution 


Proof of the Independence of the Mean and Variance of Random 
Samples Drawn from a Normal Population 


Degrees of Freedom 
The t-Distribution 
The F-Distribution 


The Simple-Randomized Design 

The Importance of the Simple-Randomized Design 
The Hypothesis to Be Tested 

Limitations of the Simple t-Test 

The Test of the Over-all Null Hypothesis 


xi 


xii 


CONTENTS 


The Measure of Discrepancy as a “Mean Square” Ratio 

Type I and Type II Errors 

The Importance of the Assumptions Underlying the F-Test 

The Norton Study of the Effects of Non-normality and Heteroge- 
neity of Variance 

Testing the Significance of the Difference in Means for Individual 
Pairs of Treatments 

The Significance of the Difference Between Two Sample Means 
When the Population Variances Differ 

Types of Applications of the Simple-Randomized Design to Ex- 
perimental Data 

Applications of the Simple-Randomized Design to Observational 
Data 


4 Analysis of Variance in Double-Entry Tables 


Introduction 


Analysis of Total Sum of Squares into Four Components (Method 
of Arithmetic Corrections) 


Analysis of Total Sum of Squares into Four Components (Algebraic 
Method) 


The Case of One Observation per Cell 
Computational Procedure 

A Worked Example 

The Generalized Meaning of Interaction 


Treatments Xx Levels Designs 

Generalized Case of the Treatments x Levels Designs 
The Analysis of the Total Sum of Squares 

The Meaning of Interaction 

Constituting the “Levels” in an Experiment 

Selection of the Control Variable 

Testing the Significance of the Main Effect of Treatments 
Test of the Significance of the Interaction 

The Meaning of ms4/msaz 


108 


110 


112 
114 
114 
115 
118 


121 
121 
123 
123 
127 
132 
133 
138 
14] 


CONTENTS 


Treatments X Levels Designs with One Observation per Cell 
Tests of Significance Applied to Individual Differences 
Possibilities of Confounding Extraneous Factors with Levels 
Limitations and Advantages of the Treatments X Levels Design 
What to Do About Missing Cases 


The Use of Transformations 


The Treatments x Subjects Design 

The Generalized Case of the Treatments X Subjects Design 
Analysis of the Total Sum of Squares 

"Testing the Significance of the Treatments Effects 

Limitations and Advantages of the Treatments X Subjects Design 
Randomizing or Counter-Balancing Sequence and Order Effects 
Confounding Extraneous Factors with “Subjects” 

Testing Differences in Individual Pairs of Treatment Means 


Establishing a Confidence Interval for the True Mean for Any 
"Treatment 


The Groups-Within-Treatments Design 
The Generalized Case of the Groups-Within-Treatments Design 


The Analysis of Variance in Groups-Within-Treatments Designs 
(The Subject as the Unit of Analysis) 


Computational Procedures (Subject the Unit of Analysis) 


Analysis of Unweighted Group Means (The Group as the Unit of 
Analysis) 
"Test of the Hypothesis of Equal Treatment Means (Unweighted) 


The Groups Considered as Random Samples from Corresponding 
(Hypothetical) Subpopulations 


The Expected Values of msa and ms¢a 
Meaning of F = msewa/MS8wg 


Precision of Individual Means and of Differences in Pairs of Means 


xiii 
145 
146 
146 
147 
148 
149 


156 
156 
156 
157 
160 
162 
163 
164 


166 


172 
172 


174 
176 


177 
178 


182 
182 
185 
186 


xiv 


CONTENTS 


General Advantages and Limitations of the Groups-Within- 
"Treatments Design 


The Random Replications Design 


The Generalized Case of the Random Replications Design 


The Random Replications (A X R) Design When the Population 
Consists of Finite Groups 


Replications of the Simple-Randomized Design with Subgroups of 
the Same Size (Random Sampling from Randomly Selected 
Subpopulations) 


The Special Case of “Simple” Replications 


Testing the AR Interaction in Random Replications of Treat- 
ments X Levels or Treatments X Subjects Designs 


Testing Differences in Individual Pairs of Treatment Means 


Establishing a Confidence Interval for the True Mean for a Given 
Treatment 


Important Precautions in the Planning and Administration of a 
Random Replications (A X R) Experiment 


Advantages and Limitations of the Random Replications Design 
The Possibilities of Simple Random Replication 


The Use of ms4; as an Error Term in Treatments X Levels Designs 
gl 


Factorial Designs (Two Factors) 


The Generalized Case of the Two-Factor (A X B) Design 

The Test of the AB Interaction and Its Interpretation 

Testing the Significance of the Main Effect of Either Factor 
Testing the Simple Effects of Either Factor 

Individual Comparisons of Row, Column, or Cell Means 

The Meaning of msar »/msaz When the Interaction Is Significant 


The Conditions Under Which ms4or 5,/ms45 Is Distributed as F 


187 


190 


190 


190 


CONTENTS xv 


How to Make Comprehensive Tests of Significance When the Inter- 
action Is in Part Intrinsic and in Part Due to Randomized Type G 


Errors 215 
The Use of Transformations 216 
10 Three-Dimensional Designs 220 
Introduction 220 
Analysis of Total Sum of Squares 220 
Computational Procedures 225 
Meaning of Triple Interaction 228 
Applications of T! hree-Dimensional Designs 230 
Random Replications of a Two-Factor Experiment (A X Bx R 
Designs) 230 
Treatments X Treatments X Subjects (A X B X S) Designs 231 
Random Replications of Treatments X Levels Designs (A X LX R 
Designs) 238 
Two-Factor Experiments with Matched Groups (AX BX L 
Designs) 239 
Three-Factor (A X B X C) Designs 243 
"Testing Differences in Individual Pairs of Means in Three-Dimen- 
sional Designs 243 
11 Higher-Dimensional Designs ‘ 254 
Analysis in Higher-Dimensional Designs 254 
Computational Procedures 254 
Interpretation of Higher-Order Interactions 255 
A Notation for Factorial Designs 256 
Practical Limitations of Higher-Dimensional Designs 256 


“Complete” and “ Incomplete” Factorial Designs 257 


CONTENTS 


19 Latin Square and Graeco-Latin Square Designs 


Introduction 
Analysis in Simple Latin Square Designs 
Confounding in Latin Square Designs 


Graeco-Latin Squares 


ntrolling Individual Differences in Factorial Experiments 


13 aoro the Use of “Mixed” Designs 
Introduction 

Type I Designs 

Type II Designs 

Type II Designs 

Type IV Designs 

Type V Designs 

Type VI Designs 

Type VII Designs 

Summary of Two- and Three-Factor Designs 
Additional Designs 

Partial Confounding 


Mixed Higher-Order Experiments 


14 Analysis of Covariance 


Nature and Purposes of Analysis of Covariance 
Basic Formulas 


An Illustrative Example 


Importance of the Assumptions Underlying the Test of Significance 


of the Treatments Effect 


CONTENTS xvil 


Test of Homogeneity of Regression 330 
Generalized Procedure 332 


Analysis of Covariance vs. the Treatments X Levels Design as a 
Means of Increasing the Precision of an Experiment 333 


Analysis of Covariance as a Means of Introducing an Additional 


Factor into a Factorial Experiment 334 
Statistical Control of More than One Concomitant Variable 335 

d 5 Tests Concerned with Trends 340 
Introduction 340 
Tests of Trend in the Simple-Randomized Design 341 
Tests for Trend in Treatments X Levels Designs 347 
Tests for Trend in Treatments X Subjects Designs 348 
Tests of Trend in Type II (Confounded) Designs 349 
Comparisons of Trends 350 
Designs Appropriate for Trend Comparisons 351 
16 Estimation of Variance Components in Reliability Studies 357 
Introduction 357 

The One-Dimensional Design 359 
The Two-Dimensional Design 362 
The Three-Dimensional Design 372 


Groups (of Observations) Within Subjects 381 


N 


6 


Tables 


Table of x? 
Table of t 
Percent Points in the Distribution of F 


Phase 1 of the Norton Study. Percents of Mean Square Ratios in 
Empirical Distributions Exceeding Given Percent Points in the 
Normal Theory F-Distribution 


Phases 2, 3, and 4 of the Norton Study. Percents of Mean 
Square Ratios in Empirical Distributions Exceeding Given 
Percent Points in the Normal Theory F-Distribution 


Distribution of Intelligence Test Scores for an Experimental 
Sample and for the Population Represented by the Sample 


Analysis of Total Sum of Squares in Three-Way Table into Eight 
Components by Successive Applications of the Method of 
Arithmetic Corrections 


Computational Formulas for a Three-Classification (R X C X S) 
Design 


Complete Sets of Orthogonal Latin Squares 
Analysis of Type I Designs 

Analysis of Type I Designs 

Analysis of Type II Designs 

Analysis of Type IV Designs 

Analysis in Type V Designs 

Analysis in Type VI Designs 

Analysis in Type VII Designs 

Summary of Two- and Three-Factor Mixed Designs 
Appendix: Table of Random Numbers 


29 
38 
41 


82 


84 


129 


224 


Introduction: Fundamental Concepts and 


Basic Designs 


The Nature and Purpose of Educational and Psychological 


Experiments in General 


The major purpose of psychological experiments is to describe the effect of 
certain experimental “treatments” upon some characteristic of a particular 
population, or to test some hypothesis about this effect. The term "'treat- 
ment” will be used throughout this text to refer in general to any induced or 
selected variation in the experimental procedures or conditions whose effect 
is to be observed and evaluated. Ina given experiment the “treatments” may 
be methods of teaching a particular school subject; they may be interpolated 
experiences in a learning series; specific amounts of practice in performing a 
certain task; induced states of anxiety; the taking of certain types of achieve- 
ment tests; specified dosages of certain drugs; etc. If the treatments represent 
different amounts or degrees of a single variable, that variable will be referred 
to as the "experimental variable." For example, the treatments may 
represent different amounts of practice in the same task, or differing intensities 
of a certain auditory stimulus, etc. Jn many experiments, however, there is 
no single “experimental variable," or no single respect in which the various 
treatments differ from one another. That is, the treatments may represent 
complex combinations of variations in a large number of factors which may 
not be specifically identified — as when the treatment consists of a complex 
method of teaching a given school subject, and there are many respects, 
instead of only one, in which the various treatments differ from one another. 

In most experiments, the observed “effect” is described in terms of changes 
or differences in the mean value of a certain “criterion” variable, For example, 
the effects of different experimental conditions under which a certain task is 
performed may be measured in terms of the mean time of completing the task. 
In other experiments, the effects may be measured in terms of variances, 

1 


2 FUNDAMENTAL CONCEPTS AND BASIC DESIGNS 


ranks, correlation coefficients, regression coefficients, proportions, etc. A 
single complex experiment may be concerned with more than one experimental 
variable (or more than one treatment classification), and also with the possible 
effects of the treatments on more than one criterion variable. However, the 
general purpose of the experiment may usually be analyzed into more specific 
purposes, each concerned with the effect of a single factor on a single criterion 
variable. 


The Importance of Measures of Precision 


Tn general, the experimental results differ from subject to subject, and are 
influenced by accidental or unintentional variations in many extraneous 
factors. Accordingly, the “effect” observed in a single experiment must 
always be regarded as an estimale of the corresponding “true” effect, that is, 
the effect that would have been obtained in a perfectly controlled experiment 
involving all members of the specified population. The usefulness or value of 
the experiment therefore depends upon two major characteristics of the 
"estimate" obtained: (1) its freedom from bias, and (2) its precision. An 
estimate may be said to be free from bias to the degree that its average value 
for an increasing number of similar experiments tends to approach the “true” 
value. The precision of the estimate depends upon the variability of such 
estimates for such a series of experiments — the less variable the estimates, the 
more precise is any single estimate. 

In any experiment, it is just as important to know how precise the estimate 
is as to have the estimate itself. If no description whatever of the precision 
of the estimate is available, anyone may successfully contend that the observed 
effect is due entirely to “error” — to fluctuations in random sampling or to 
unintentional variations in extraneous factors. The estimate is therefore 
worthless, no matter how precise it may be in fact. This does not mean that 
the precision of the estimate must always be objectively described in quanti- 
tative terms, such as in terms of the standard error of the estimate. But unless 
some description of its precision is available, even though subjective, unre- 
liable, and non-quantitative, it is impossible to know what inferences may 
safely be drawn from the estimate, or within what limits one may rely upon 
those inferences. It is extremely important, therefore, that the experiment 
be planned so as to provide a dependable description of precision with each 
estimate obtained. 

Unfortunately, in the planning of many experiments, consideration is 
originally given only to the problem of how to measure or describe the desired 
effecl, that is, to the problem of estimation. The problem of how to describe 
the precision of the estimate or how to test its significance frequently receives 
no consideration whatever until the experiment is concluded and all results 
are in. At this point, unfortunately, it is frequently found that the experi- 
mental design does not permit the valid use of any known test of significance, 
whereas a change could easily have been made in the original design to make 


THE IMPORTANCE OF MEASURES OF PRECISION 3 


this possible. It should therefore be a maxim of experimental research that 
in the original planning of the experiment provision should be made both for an 
unbiased estimate of the desired effect and for a valid quantitative description 
or estimate of the precision of the estimated effect, "The latter description will 
be referred to as the "error estimate." Nota single step should be taken in the 
administration of the experiment until the problem of how to analyze and 
evaluate the results has been thought through and solved in complete detail. 

In attempting to improve a contemplated experimental design, then, any 
provision for increased precision in the estimate of the treatment effect must 
be accompanied by a corresponding revision in the estimate of error. In fact, 
no efforts to increase the precision of an experiment will be of much avail 
unless one can also dependably describe the increased degree of precision 
attained, On the contrary, if by some device one eliminates a certain source 
of error, yet continues to employ an error estimate allowing for errors from 
that source, the experiment may appear less conclusive than before, even 
though it is actually more precise. Suppose, for example, that in order to 
control chance differences among the treatment groups in a learning experi- 
ment, these groups are “matched” with reference to intelligence test scores, 
but that nevertheless the test of significance appropriate for random (un- 
matched) samples is still used. In this case, the observed differences in the 
learning criterion among the various treatment groups may actually become 
smaller (since the differences may no longer be inflated by this particular 
source of error); but this fact will only make the available estimates of error 
appear larger in relation to the reduced differences. It should therefore also 
be a maxim of experimental design that if a given source of error cannot be 
eliminated both from the experimental results and from the estimates of 
error, it had usually better not be eliminated at all. In other words, it may 
sometimes be desirable to select an experimental design resulting in lower 
precision than another, if the first design permits a valid estimate of error 
and the second does not. 

In general, the variations in the criterion measures among all of the subjects 
involved in an experiment (the “total” variance) may be attributed to a 
variety of different factors, or may arise from a number of different sources. 
If the experiment has been properly designed, it will be possible to analyze the 
total variance into a number of independent components, each of which may 
be identified with one of these sources of variation. One part of the total 
variance will be due to the experimental treatments. Other parts will be due 
to extraneous factors which have been controlled in the experiment, so that 
their effects on the various treatment groups are the same. Still other parts of 
the total variance will be due to uncontrolled sources, that is, to error." The 
test of the significance of the treatment effects consists essentially in a com- 
parison of the “treatments” variance with the “error” variance, in order to 
determine whether or not the observed differences among the treatment 
means may be attributed to these uncontrolled sources. Suppose, for example, 
that in a learning experiment in which the various treatment groups have been 


4 FUNDAMENTAL CONCEPTS AND BASIC DESIGNS 


matched with reference to intelligence test scores, a single distribution of the 
criterion measures is made for all subjects in all treatment groups. The vari- 
ance of this distribution (the total variance) is presumably due in part to 
differences among the treatments, It is also due in part to differences from 
subject to subject in whatever is measured by an intelligence test, a source 
of variation which in this case is controlled by the matching of subjects. 
Finally, it is due in part to random (uncontrolled) variations in the many 
other factors affecting learning, such as nature and amount of previous 
training, age, sex, motivation, etc. The test of significance of the treatment 
effect, consists essentially in a comparison of the treatments variance with the 
error variance, 

Sometimes, as has been suggested earlier, the mistake is made of failing to 
isolate that part of the total variance which is due to controlled sources, and an 
"error" variance is employed which is really a composite due both to con- 
trolled and uncontrolled sources. In matched group experiments like that 
just described, for example, the mistake has been made of failing to “take out” 
of the error variance that part of the total variance due to the controlled 
(matched) variable, and of still using the error estimate appropriate for 
independent (unmatched) random samples. The result, of course, is to 
make the so-called “error” variance unduly large and to make the experiment 
seem less precise than it really is. 

We can now understand why the technique of analysis employed with many 
experimental designs is known as the method of. analysis of variance. In these 
terms, the aim of the experimenter is to employ a design that will control or 
equalize the effect of as many important extraneous variables as feasible, that 
will randomize the effects of all uncontrolled factors, and that will permit 
analysis of the total variance into independent components which may be 
identified respectively with the treatments, the controlled sources of variation, 
and the "error" variations. The “error” variance will then constitute an 
accurate and unbiased estimate of the precision of the experiment, in terms 
of which the treatments effect may be properly evaluated. 

Complete freedom from bias and perfect precision in an experiment are, of 
course, both impossible and unnecessary. How unbiased or how precise an 
estimate need be depends upon the broader purposes of the experiment. 
Some experiments are intended to determine only whether an effect exists 
at all, or whether there is any relationship between the experimental and 
criterion variables. In that case, if the true effect is considerable, or if the 
true relationship is pronounced, even a very crude experiment may reveal 
the presence of the effect or relationship. In other experiments the presence 
of some relationship between the experimental and criterion variables may be 
taken for granted; the purpose of the experiment may be to describe the 
magnitude or the nature of that relationship, and the effects to be described 
may be known or expected to be of relatively small magnitude, In such an 
experiment, obviously, a relatively high degree of precision is essential. 

In designing the experiment, therefore, the aim of the experimenter should 


TESTING HYPOTHESES E 


usually be, not to provide for the highest possible degree of precision in the 
estimate, but rather to secure, with the minimum ezpendilure of his resources, 
whatever degree of precision and freedom from bias is sufficient for his purposes. 
His objective, in other words, is to design an experiment that will serve the 
specified purposes with maximum efficiency. No attempt will be made here to 
provide an exact definition of the “efficiency” of an experiment. Its efficiency 
may be roughly described as its “precision per unit of cost," but what is 
involved in “cost,” or in what units it may be measured, is often very diffi- 
cult to say. The true cost of an experiment may seldom be satisfactorily 
described simply in terms of dollars expended. Whenever human subjects are 
employed, the time, convenience, comfort, and motivation of the subjects are 
often more important than the time and convenience of the experimenter, 
but as elements of cost these factors are very difficult to assess or describe 
quantitatively. It should be clear, however, that the more precise of two 
experimental designs is not always the one to be preferred. The “cost” at 
which this precision is obtained is always an extremely important consideration 
in the choice. Unfortunately, in practice, the purposes of the experiment are 
often too vaguely defined or too complex to permit any exact statement of the 
degree of precision required, and the cost may be very difficult to assess. 
Hence, the practical aim of the designer frequently becomes only that, in 
effect, of securing the highest precision in the estimates that his own resources 
will permit. In other words, the factor of efficiency is often neglected. 


Testing Hypotheses 


The ultimate objective of psychological and educational research in general 
is to develop a more complete theory — of learning, of mental organization, of 
school organization, etc. It is therefore useful, in most experiments, to view 
the major purpose as that, not only of describing the effects of the treatments, 
but of testing some specific hypothesis concerning the true effects. In accord 
with the law of parsimony, we usually begin by testing the simplest possible 
hypothesis that will explain the observed effects. We will entertain more 
complex hypotheses only if we are forced to reject the simpler ones. Accord- 
ingly, we most often begin with the hypothesis that the true effect is nil, or 
that there is no true difference among the experimental treatments so far as the 
criterion is concerned. The specific purpose of most experiments is thus to 
test a “null” hypothesis. If we are forced to reject this hypothesis, we may 
then consider more complex hypotheses and plan further experiments to test 
these hypotheses more fully. For example, having shown by an initial 
experiment that there is some relationship between the criterion and the 
experimental variable, we may then plan a further experiment to test the 
hypothesis that this relationship is linear. If this hypothesis must be rejected, 


1 The term is used here in a somewhat different meaning than is usual in the math- 
ematical theory of statistics. 


6 FUNDAMENTAL CONCEPTS AND BASIC DESIGNS 


we may next test the hypothesis that the relationship is parabolic, etc. Before 
going on to more complex hypotheses, however, we want to make very sure 
that the simple hypotheses are false, and that we are not following a “blind 
lead” in our subsequent experiments. 

We know that, even though the hypothesis to be tested is true, we cannot 
expect the observed effect in an experiment to agree exactly with the hypo- 
thetical true effect. Noting the discrepancy between the observed effect and 
the hypothetical true effect, we ask, is this discrepancy too large to be reason- 
ably attributed to “error,” — too large to enable us to retain the hypothesis? 
If so, just how confident may we be that the hypothesis is false? If the 
experiment has been properly designed, we can supply objective and quanti- 
tative answers to these questions. Thus a major objective of the design of an 
experiment is lo make such answers possible. 


The Essential Characteristics of a Good Experimental Design 


The essential characteristics of a good experimental design may now be 
summarized as follows: 


1) It will insure that the observed treatment effects are unbiased estimates 
of the true effects. 


2) It will permit a quantitative description of the precision of the observed 
treatment effects regarded as estimates of the “true” effects. 


3) It will insure that the observed treatment effects will have whatever de- 
gree of precision is required by the broader purposes of the experiment. 


4) It will make possible an objective test of a specific hypothesis concerning 
the true effects; that is, it will permit the computation of the relative 
frequency with which the observed discrepancy between observation 
and hypothesis would be exceeded if the hypothesis were true. 


5) It will be efficient; that is, it will satisfy these requirements at the 
minimum “cost,” broadly conceived. 


These are not the only essential characteristics of a good experiment. The 
usefulness or worthwhileness of an experiment is primarily dependent upon a 
great many other factors, which can be no more than barely identified in a 
book of this kind. In the earliest stages of planning any experiment, the 
problem to be investigated is usually stated in relatively indefinite and 
general terms. As the planning proceeds, the problem is modified and restated 
repeatedly, always more definitely and specifically, or always in a form more 
amenable to experimental attack — especially in view of the subjects, equip- 
ment, materials, and other resources available to the experimenter. Indeed, 
the final step in the planning is often to restate the problem once more so as to 


BASIC EXPERIMENTAL DESIGNS 7 


make it fit the particular experimental design that seems feasible, rather than 
to make a final modification of the design to fit a final statement of the problem. 

The important decisions to be made in planning the experiment are con- 
cerned with: (1) the definition of the “treatments,” (2) the selection or exact 
definition of the population to be investigated, (3) the selection of a criterion, 
(4) the identification of the factors to be controlled and the level or levels at 
which each is to be controlled, (5) the final restatement of the problem, and 
(6) the selection of a specific experimental design. "These decisions are inter- 
dependent. A decision made at a particular stage in the planning may require 
modifications in previous tentative decisions, which may in turn affect other 
previous decisions, etc. The selection of the experimental design is usually 
the last step taken, but, as already noted, even it may suggest desirable 
modifications in other decisions previously made. 


Basic Experimental Designs 


The majority of educational and psychological experiments in general are 
intended to determine the effects of certain treatments upon the mean value of 
a certain criterion variable for a specified population. It is with the designs 
employed in such experiments that this book is primarily concerned. 

Nearly all complex designs that are employed in experiments of this type 
may be regarded as variations or combinations of a small number of basic 
designs. It is possible, by selecting relatively simple and restricted examples, 
to illustrate concretely the application of each of these basic designs in terms 
of elementary statistical concepts already familiar to the student. This will 
be done in the remaining sections of this chapter, in order to give the student 
some advance appreciation of the nature and content of the text as a whole. 
The student is not expected to make any intensive study of these basic designs al 
this point; a brief consideration of them is all that is needed for these intro- 
ductory purposes. The purpose of most of the remaining chapters will then 
be to generalize the application of these basic designs to less restricted situa- 
tions and show how they may be combined and modified to suit various 
experimental conditions and types of problems. The basic designs are as 
follows: 


1) Simple-Randomized Designs: Those in which each treatment is inde- 
pendently administered to a different sample of subjects, all samples 
independently drawn at random from the same parent population, 


2) Treatments X Levels Designs (read “treatments by levels”): Those in 
which the various treatments are administered to samples “matched” 
with reference to a variable or variables related to the criterion and 
therefore more alike in response to the treatments than simple random 


samples would be. 


8 FUNDAMENTAL CONCEPTS AND BASIC DESIGNS 


3) Treatments X Subjects Designs (read "treatments by subjects"): Those 
in which all treatments are successively administered to the same 
subjects. 


4) Random Replications Designs: Those in which the same basic experiment 
(of the simple-randomized type) is “replicated” (repeated) with inde- 
pendent samples of subjects. The subjects for the various ‘‘replications” 
may be drawn from the same population, or the experiment as a whole 
may be concerned with a population consisting of a large number of 
subpopulations, and the subjects for each replication may be drawn 
at random from a different and randomly selected subpopulation. In 
either case, the “replications” represented in the experiment may be re- 
garded as a random sample from a larger number of possible replications. 


5) Faclorial Designs: Those in which there are two or more cross-classifica- 
tions of treatments, or in which the effects and interactions of two or 
more experimental variables are simultaneously observed. 


6) Groups-Within-Treatments Designs: Those in which the population to be 
investigated consists of a large number of finite groups, and in which each 
treatment is administered to an independent random sample of intact 
groups. 


The essential nature of each of these designs will later be further clarified and 
the reasons for the names given them made more apparent by concrete 
illustrations. 


Basic Types of Error 


The observed differences among the treatment means in any experiment are 
due only partly to actual differences in the effectiveness of the treatments. 
They are also partly due to errors of various kinds, that is, to the effects of 
extraneous variables or factors. Some errors may vary from experiment to 
experiment; others may be constant over all experiments concerning the same 
treatments. The variable errors may be classed into three categories accord- 
ing to the experimental units with which they are associated: subjects, treat- 
ment groups, and replicalions. Errors which are constant for all replications or 
throughout the experiment cannot be taken into consideration in any error 
analysis, their effects being inextricably intermingled with or inseparable from 
the treatments effect. 

It may be worth while to illustrate these three types of variable errors by 
referring to a concrete experiment. Suppose an experiment to determine the 
relative effectiveness of two methods (A, and A») of teaching fourth-grade 
arithmetic is performed in a certain school. For the purposes of the experi- 
ment, the pupils are divided into two groups; one is to receive Treatment or 
Method Aj, the other Treatment A». One teacher is assigned to teach one 


BASIC TYPES OF ERROR 9 


group by one method, another to teach the other group by the other method. 
The “treatments effect” is measured by the difference in the mean scores for 
the two groups on a criterion achievement test administered at the close 
of the experiment. 

Suppose that essentially the same experiment is performed independently in 
several different schools, the same treatments being administered under as 
nearly as possible the same conditions in each school. Suppose, also, that 
these schools are drawn from a certain population of schools, such as all 
public graded elementary schools in Iowa. Suppose, finally, that the object 
of these experiments individually and collectively is to estimate the difference 
in treatment means for the entire population of schools from which these 
particular schools are drawn, and to test the hypothesis that this difference is 
zero. In this situation, we may now illustrate the various types of error 
as follows: 

Type S Errors: In any single replication of such an experiment, it is usually 
left to chance to determine which treatment each subject is to receive. If the 
subjects are assigned at random to the treatment groups, the group assigned to 
A, may by chance contain a larger proportion of the more intelligent pupils, 
or of those who like arithmetic, or of the more industrious pupils, or of the 
pupils who received superior instruction during the year preceding, etc., etc. 
Accordingly, the mean criterion score may be higher for A, than for A», even 
though the treatments are on the average equally effective for all pupils in 
general in the particular school involved. That part of an observed treatment 
effect which is thus due solely to the assignment of subjects to treatment groups 
will, for convenience in the subsequent discussion, be referred to as a “Type S” 
error. Type S errors are those which characterize simple random sam- 
pling. 

Type G Errors: In any single replication of an experiment, substantial 
differences would be found in the criterion means for the various treatments, 
even though no Type S errors were present, and even though the methods or 
treatments were on the average equally effective for the population sampled. 
Such differences may result, from countless extraneous factors which tend to 
have the same effect on all members of any one treatment group but a differ- 
ent constant effect on the members of any other treatment group, and which 
thus create systematic differences in the criterion means from group to group 
in the same replication. For example, in the illustration used, the group re- 
ceiving Treatment A; may have been assigned a better teacher, or a better 
classroom, or a more favorable hour of the day for instruction, than the As 
group. Again, the pupils receiving Treatment A; may inadvertently have 
been given more time on the criterion test than those receiving treatment A2, 
or some other accidental failure to administer the experiment properly may 
favor one treatment at the expense of the other in the comparison of means. 
The effect of most such factors can and should be randomized with reference 
to treatments in each replication independently; those that arise during the 
experiment and cannot be randomized will in most cases be accidental and 


10 FUNDAMENTAL CONCEPTS AND BASIC DESIGNS 


without bias. If this is the case, such errors will tend to cancel out in the long 
run (for a very large number of replications), but in any single replication 
they may have a pronounced effect on the observed differences in treatment 
means. 

In general, these errors are associated with the administration of the 
experiment or with the experimenter, in the sense that they are subject to 
the experimenter's control, although many of them (such as teacher differ- 
ences) are unavoidable and not the “fault” of the experimenter. As already 
noted, it is the experimenter's responsibility to reduce these errors to a 
minimum, and to randomize the effects of any factors that cannot be com- 
pletely equalized. For example, in the illustration used, after having done 
his best to secure equally good teachers for both methods, the experimenter 
may then flip a coin to determine which teacher is to be assigned to Method Ai 
and which to As. 

Tt might be well to draw attention here to a very important source of error 
in educational and psychological experiments which has often gone un- 
recognized, or which, when recognized, has often been misclassified. In the 
illustrative situation, for example, suppose that among the pupils assigned 
to the experiment in this particular school, one pupil is a notorious trouble- 
maker, seriously interfering with the effectiveness of instruction in his classes. 
A source of variation of this kind is certainly associated with an individual 
pupil, but it is not independent of all other pupils. On the contrary, it exerts 
an important systematic effect on all pupils in one treatment group, and is 
therefore a Type G rather than a Type S error. The presence of a “natural 
leader” who exerts a beneficial influence on the members of his group, or the 
“esprit de corps” developing from the subtle influences of the pupils upon 
one another, also illustrate Type G errors. The importance of such factors in 
“group dynamics” is often overlooked in experiments where the treatments 
are administered on a group basis, or the assumption is wrongly made that 
the effects of such factors are taken into consideration by tests of significance 
based upon Type S errors only. 

Errors of the type here illustrated will, for convenience in the following 
discussion, be referred to as “Type G” errors. Type G errors may be defined 
as those due to the operation of extraneous factors which tend to have the 
same effect on all members of any given treatment group, but different effects 
on different treatment groups in any single replication. 

Type R Errors: In the illustrative situation, it is quite possible that Treat- 
ment A; may actually be better than A; for certain schools in the given 
population, but that A; may really be inferior to A, for certain other schools. 
This could result from differences in curriculum, or in the administrative 
organization of the schools, or in school plant and equipment; or it could be due 
to any other conditions in the school or community making one method really 
more appropriate or effective than the other for that particular school or com- 
munity. The observed effect of a treatment in any particular school could 
then be free from error so far as that school alone is concerned, yet be con- 


THE PRINCIPLE OF RANDOMIZATION 1i 


siderably in error as an estimate of the average treatment effect for all schools 
in the given population of schools. 

There are many experiments in education and psychology which thus con- 
sist of a number of independent replications, each performed with samples 
drawn at random from a different sub-population in the total population with 
which the experiment as a whole is concerned. Variations in treatment 
effects from replication to replication, due neither to Type S nor Type G 
errors, but genuinely characteristic of the individual replications or sub- 
populations, will be referred to in this discussion as “Type R” errors. 


The Principle of Randomization 


It is never possible to eliminate or to equalize completely the effects of any 
type of error, but under certain circumstances any bias in the treatment effect 
resulting from uncontrolled error variations may be successfully eliminated by 
randomizing the error variations with reference to the treatments. In the 
illustration just used, for example, a strictly random procedure may be fol- 
lowed in assigning the pupils to the treatment groups. While certain group 
differences (Type G errors) may be unavoidable, their effects may nevertheless 
be randomized. For example, it may not be possible or practicable to secure 
equally good teachers or to use the same teacher for both treatments; but it 
may be possible to leave entirely to chance the assignment of teachers to 
treatments. Similarly, the assignment of classrooms, periods, equipment, 
etc., may be done on a strictly random basis. Finally, if the experiment is to 
be replicated in several different schools, care may be taken to select a strictly 
random sample of schools from the population of schools involved, so that 
there will be no systematic tendency to select a disproportionately large 
number of schools in which one treatment is actually superior to the other. 

There is, however, a still more important reason for randomizing error 
variations with reference to treatments in each replication. All tests of 
statistical significance and all standard error formulas are based on the 
fundamental laws of probability. Cognizance can be taken of a certain 
source of error in a statistical test of significance only if it has been left 
entirely to chance which treatment will benefit from this source in any single 
comparison of the treatments. Furthermore, a number of such comparisons 
must be made, in each of which the effect of the given source of error has 
been independently randomized with reference to the treatments. The 
usual “error estimate” is an estimate of the population variance of errors 
due to certain sources, and at least two observations, of course, are essential 
to the computation of a variance estimate. One of the most important and 


1Tn later chapters these Type R errors will often be referred to as “interaction” 
effects, or, more specifically, as the intrinsic treatments X replications ‘nteraction 
effects (see pages 193-194). The extrinsic treatments replications interaction effects 


zonstitute the Type G errors. 


12 FUNDAMENTAL CONCEPTS AND BASIC DESIGNS 


basic of all principles of experimental design is thus the principle of randomiza- 
lion. Briefly restated, this is the principle that a given type of error can be 
eliminated as a source of bias and can be taken into consideration in an error 
estimate or a test of significance only if a number of independent observations 
of the effect of this type of error have been obtained, and only if these observations 
may be regarded as a random sample from all such observations possible. 

It should be noted that each of the three basic types of error must be 
randomized in an experiment if each is to be taken into consideration in a 
valid test of significance. Subjects must be randomly assigned to the treat- 
ment groups; Type G errors must be independently randomized for each 
replication; and the subpopulations with which the individual replications are 
performed must be selected at random from all subpopulations constituting 
the population involved. With some designs, as we shall see, it is possible to 
randomize certain but not all of these basic types of error. Thus the combina- 
tion of error types for which the error estimate or test of significance is valid 
may differ from one design to another. Sometimes it is possible to randomize 
both the Type S and the Type G errors but not Type R errors. In other 
situations, it may be possible to randomize some of the Type G errors but not 
others. Whenever possible, of course, the experimenter should use a design 
permitting a test of significance that will be valid for all types of error affecting 
the observed result. That is, the experimenter should know how the treat- 
ment effects would fluctuate in similar experiments as the result of the com- 
bined effect of all sources of error. The sampling distribution involved in the 
test of significance should be the joint error distribution. 


Illustrative Applications of Basic Designs 


In terms of principles and concepts thus far considered, we are now ready 
to consider a specific illustration of each of the basic experimental designs 
listed on pages 7 and 8. Again it should be emphasized that the student is 
not expected at this point to make an intensive or exhaustive study of these 
designs. Two or three careful readings of these descriptions should suffice 
for the introductory purposes intended. 

Simple-Randomized Designs: Consider again an experiment like that already 
described, in which the “treatments” are two methods of teaching a given 
unit of school instruction in arithmetic. The population involved consists 
of “all fourth-grade pupils in the public schools of Iowa”; the criterion is 
the score on an objective achievement test designed for the given unit of 
instruction, Suppose that, in a certain school, 60 pupils are available for 
the experiment, and that these may be regarded as a random sample from all 
pupils who might be taught fourth-grade arithmetic in that school. Suppose 
that these pupils are randomly assigned, 30 to one group and 30 to another; 
that all administrative arrangements are made for these groups, including the 
assignment of teachers, classrooms, periods, etc.; and that as a final step in 


ILLUSTRATIVE APPLICATIONS 13 


the arrangement, a coin is flipped to determine which group is to be taught by 
Method A; and which by Method A». At the close of the period of instruction, 
both groups are given the criterion test, and the mean scores (M4, and M,,) 
of the two groups are computed and compared. The estimated standard 
error of the difference in these means is computed by the formula 


est'de a, yy \ = Edi + Zda, \/ 1 + 
Qr 09) na, + na, — 2/NDA, Nas 


in which ds, and d4, represent individual deviations from M4, and Ma, 
respectively, and n4, and n4, represent the corresponding numbers of cases. 
The significance of the difference is then tested by 


for which the number of degrees of freedom is (n4, + n4, — 2). 

It is immediately apparent that this test of significance takes into considera- 
tion Type S errors only. Most of the irrelevant factors resulting in Type G 
errors (such as teachers, classrooms, etc.) may have been randomized in the 
assignment of treatments to groups, but since we have only one observation of 
a difference in group means (that is, since we have only one observation 
containing a Type G error) we cannot estimate the variability of such errors. 
For the same reason we cannot estimate the variability from replication to 
replication (school to school) of treatment differences which are genuinely 
characteristic of the replications. That is, we cannot take Type R errors into 
consideration in the test of significance. Furthermore, this design only ran- 
domizes, but does not control or equalize, subject variations, except insofar as 
they are equalized by the random assignment of subjects to groups. Since the 
subjects involved in educational and psychological experiments are typically 
characterized by very large individual differences, it is necessary when using 
this design to employ relatively large samples in order to secure the desired 
degree of precision. For these reasons, the simple-randomized design finds 
relatively few applications in educational and psychological research, except 
as a unit in more complex designs. 

The illustration of the simple-randomized design just presented involves 
only two treatments. When comparisons are to be made simultaneously 
among several treatments, a different analytical procedure and a different test 
of significance must be employed, but the possibilities and limitations of the 
design are otherwise unchanged. The manner in which the more generalized 
case is handled will be considered in Chapter 3. 

Treatments X Levels Designs: The treatments X levels design provides for 
direct control of inter-subject variations. In this design, the treatments are 
administered to samples that have been “matched” with reference to a 
“control” variable or variables. Consider again the experiment in teaching 
fourth-grade arithmetic. Suppose that, on the basis of their achievement in 


14 FUNDAMENTAL CONCEPTS AND BASIC DESIGNS 


third-grade arithmetic, the pupils had previously been rated as “superior,” 
“average,” or “inferior.” Suppose then that the pupils from each of these 
three subgroups are independently assigned at random to the two experi- 
mental treatments. The two treatment groups would then be “matched” 
with reference to third-grade achievement, in that each treatment group 
contains the same proportion of subjects at each level. 

This illustrative experiment may be diagrammed as follows: 


Level Treatment 
A 1 A 2 


Li: Superior 
Ls: Average 


Ls: Inferior 


Ma, Ma, 


For the sake of simplicity of illustration, we have let the number of cases 
(n = 10) be the same in each of the six subgroups. Accordingly, the over-all 
mean for Treatment A, is 


Ma, = AMA, + Mas, + Ma,1,). 


Now we know from elementary statistics ! that the variance of the sum of a 
number of unrelated variables is equal to the sum of their variances, and that. 
the variance of k times a given variable is equal to k? times the variance of 
that variable. From this it follows that the estimated error variance of M4, 
(k being equal to 1) is 


Kio. desde Pe "E 
est'd cM, ad a(est d TMa r + est’d TMA ty + est'd e, s) 


zd zd zd 
M X PA 4 La 2) ww ^ So 
9NI0x 9 ^ 10x 9 * 10x 9/ ^ ai s, *id,£Ib,) 


in which d4,;, is a deviation from Mar, (in the superior group for A1), and 
daz, and d4,;, have similar meanings. The estimated error variance of Ma, 
may be similarly computed. Accordingly, the estimated error variance for 
the difference in over-all treatment means is 
342 Le 17 2 Daves 
est’d "(t4 M4.) = est'd Tha, + est'd Fi, 


and the test of the significance of this difference is 


1 See Peters and Van Voorhis, Statistical Procedures and Their Mathematical Bases 
(New York: McGraw-Hill Book Company, Inc., 1940), pp. 77, 177. 


ILLUSTRATIVE APPLICATIONS 15 
My T: Mis 


est’d "Qr, = 12 
for which the number of degrees of freedom is 54, there being 9 degrees of 
freedom for the error variance of the mean of each of the six groups involved. 
(If the concept of degrees of freedom is not already understood by the student, 
he may postpone any consideration of it to Chapter 2. For present purposes, 
the indicated degrees of freedom may be taken for granted.) 

The principal advantage of this design over the simple-randomized design 
is that it provides a direct control of Type S errors. If there is a substantial 
correlation between the control and the criterion variables, the treatment 
groups, having been made closely alike with reference to the control variable, 
will be much more alike with reference to the criterion variable than simple 
random samples would be. Differences in the criterion means due to random 
assignment of subjects to treatment groups will thus be considerably reduced, 
depending upon the degree of correlation between the control and the criterion 
variables. 

Another advantage of the treatments X levels design is that it permits 
a separate study of the treatment effects at. different levels of the control 
variable. In the illustrative experiment, for example, Treatment 4; might 
be shown to be superior to A, for pupils of superior ability, but inferior to 
As for pupils of inferior ability. To test the hypothesis that the treatments 
are equally effective at the upper and lower levels, one could use the t-test 


t 


-M apro b 
(M, s. m » Ail, Mt) 


est'd oar, , —M. -Fest'd om, , ar, 
A i v Qt z Man) (a r ™ aL) 
in which 


2 5, 2 , 2 
est'd e^ (au, zM, = est'd cy, „+ est'd oir 
(Ma r7 MA L) AU [A Li? 


and for which the degrees of freedom is the sum of the degrees of freedom for 
the standard errors of the means involved. 

Tt is conceivable that Treatment A; is superior to A» at the upper level 
and inferior to Ae at the lower level, but that for the upper and lower levels 
combined the two treatments are on the average equally effective. In that 
case, the t for (Ma, — M4) might prove non-significant and that for 
(Mayr, — Maj) — (Mayr, — Ma,r)] significant. The reverse could also 
be true; that is, one treatment might be superior to the other but equally 
so at both levels. 

The treatments effect for a given level of the control variable is known 
as the “simple” effect of the treatments at that level. The weighted average 
of the simple effects for all levels of the control variable is known as the 
“main” effect of the treatments. In the illustration, the simple effect of 
the treatments for the top level of the control variable is measured by 


16 FUNDAMENTAL CONCEPTS AND BASIC DESIGNS 


Ma — Man; the main effect is measured by M4, — Ma, which in this 
case is the simple average of the simple effects. 

If the second of the preceding t's proves significant, that is, if the simple 
effect is shown to differ from one level to another, we would conclude that 
there is an "interaction" between treatments and levels, or that the relative 
effectiveness of the treatments depends upon the level at which they are used. 
Such information could obviously be more valuable than that obtained from 
the test of the significance of the over-all difference. 

For the sake of simplicity, only three levels were used in this illustration. 
Tn actual practice, a larger number of levels would ordinarily be employed 
to insure a closer matching of the treatment groups. In many instances, 
the matching would be based on quantitative measures, such as test scores, 
rather than on categorical ratings. The distribution of such scores could 
of course be divided into any desired number of intervals or levels, and the 
number of cases might differ from one level to another. In the general case, 
there would also be more than two treatments. In the special case here con- 
sidered, it is possible to evaluate the results quite satisfactorily with simple 
t-tests, although the second t-test is concerned with only two levels, whereas 
a test of interaction should properly involve all levels. Other analytical 
procedures and tests of significance are required for the general case, and 
these will be presented later (Chapter 5). The basic limitations and possi- 
bilities of the general design, however, are adequately exemplified in this 
illustration. 

In using this design, the different “levels” need not correspond to different 
scale intervals for a single continuous variable. The levels might correspond 
to non-ordered categories in any classification applicable to the members 
of the population involved, such as sex, religious preference, nationality, 
geographical location, etc. The subjects would then be randomized with 
reference to treatments for each level independently, but the treatments 
would be administered on a group basis to all levels simultaneously. 

The name treatments X levels is ordinarily applied to a design only if 
the levels have been introduced in order to increase the precision of the 
experiment and for no other reason. However, essentially the same design 
can be employed even though the introduction of levels results in no appre- 
ciable increase in precision, the purpose being to permit a study of the simple 
effects and the differences among them (interaction) as well as of the main 
effects of the treatments. In this case, as we shall see later, the design would 
be termed a "factorial" design. The treatment X levels design is some- 
times difficult to distinguish from a factorial design; the problem of this 
distinction will be considered later in the discussion of factorial designs. 

Aside from the advantages here considered, the advantages and disad- 
vantages of the treatments X levels design are the same as those of the 
simple-randomized design. In both designs, Type G and Type R (if any) 
errors are confused with the true treatment effects, and the available test of 
significance takes into consideration Type S errors only. 


$ 


ILLUSTRATIVE APPLICATIONS 17 


Treatments X Subjects Designs: If the treatments are such that all can 
be administered in sequence to the same subjects, and if the effects of each 
treatment are uninfluenced by the fact that other treatments have previously 
been administered to the same subjects, it is possible to eliminate entirely 
the influence of inter-subject differences upon the treatments effect. Suppose 
that an experiment is intended to determine the effect of the administration 
of two drugs (A; and A») on the time required to perform a certain task, such 
as the time required to cross out every letter *e" on a given printed page. 
Suppose that each of a group of n subjects, selected at random from a speci- 
fied population, performs this task twice, once under the influence of Drug A: 
and once under the influence of Drug As. We will assume that the task has 
previously been performed often enough by the subjects so that there is no 
further improvement due to practice, and that the influence of Drug Ai has 
dissipated entirely before Drug A, is administered. For each subject, a differ- 
ence can be found between the two obtained times, a difference in favor 
of Drug A; being considered as positive and one in favor of A; as negative. 
The mean (M) and standard deviation (s) of the distribution of these differ- 
ences can then be computed, and a simple /-test can be employed to test the 
hypothesis that the population value of the mean difference is zero, as follows: 

M M 


Since exactly the same subjects take both treatments, no part of the 
difference in treatment means can be attributed to differences among the 
subjects (inter-subject differences), although chance errors of measurement 
(intra-subject differences) might still favor one treatment or the other. 
Because inter-subject differences are usually a major source of error in edu- 
cational and psychological experiments, the treatments X subjects design is 
usually far more precise than the simple-randomized or the treatments X 
levels design, granting that a fairly reliable criterion measure is employed. 
However, the usefulness of this design is severely limited by the fact that the 
effect of a given treatment is usually not independent of or unaffected by 
the previous administration of another treatment to the same subjects. 
(More complex designs will be considered later in which the effects of order 
and sequence of treatments are counter-balanced.) Furthermore, the use 
of this design usually requires that equivalent forms of a criterion test be 
available, so as to eliminate or render negligible the practice effect of taking 
the same test more than once. 

Aside from the advantages and disadvantages just noted, the treatments 
X subjects design has all the limitations of the simple-randomized and treat- 
ments X levels designs. In none of these designs does the test of significance 
take into consideration Type G or Type R errors. 

The Random Replications Design: We have seen that all the designs thus 
far considered have the common limitation that the available test of signifi- 
cance does not take Type G or Type R errors into consideration. Thus it is 


18 FUNDAMENTAL CONCEPTS AND BASIC DESIGNS 


impossible to tell to what extent the observed treatment effect is due to 
such errors and to what extent to real differences among the treatments in 
the entire population considered. 

We shall consider first a design which takes Type G (but not Type R) 
errors into consideration. To accomplish this, a number of different obser- 
vations must be made of the treatment effects, Type G errors having been 
independently randomized for each observation. Suppose, for instance, that 
a methods experiment in arithmetic is to be performed with the pupils in a 
given school, and that 150 pupils are available for the experiment. (It is 
important to note that these 150 pupils must, for the purposes of this design, 
be regarded as a random sample from a hypothetical larger population of all 
pupils who might take fourth-grade arithmetic in this school under the gen- 
eral conditions now prevailing in this school.) Suppose then, that instead 
of conducting a single experiment with two random samples of, say, 15 pupils 
each, fime separate experiments or replications are performed, the pupils 
having been randomly assigned to ten different groups for the purpose. 
Type G errors are independently randomized in each replication, and each is 
independently administered under as nearly as possible the same conditions. 
This experiment can be diagrammed as follows: 


Treatments 
A; Ag 


D; = (Mayr, —Magr,) 
D: 


Replication #1 
Replication #2 


Replication #3 Ds 
Replication #4 D, 
Replication #5 Ds 
Ma, Ma, D = (Ma,— Maj 


Since Type G errors have been independently randomized for each replica- 
tion, we now have five observations of treatment effects (five D’s) containing 
such errors. These five replications may be regarded as randomly selected 
from all possible replications of this kind in the hypothetical population re- 
ferred to earlier. Accordingly, the Type G errors contained in these five 
observations may be regarded as a simple random sample from a hypothetical 
population of such errors for an indefinite number of such replications. The 
five observed D's, then, are a simple random sample from a hypothetical 
population of such D's. Accordingly, the estimated error variance of the 
mean difference (D) is given by 

a x 
Kore ZO: cun 
est'd o5 = SERERE 


in which, in this case, n — 5. 


ILLUSTRATIVE APPLICATIONS 19 


To test the hypothesis that the true mean of these differences (that for 
the hypothetical population described) is zero, we may use the simple /-test 


D 


v estd c5 


for which the number of degrees of freedom is n—1 (in this case, 4). 

If the purpose of this experiment is to determine the relative effectiveness 
of the methods for this particular school only, then there is no Type R error 
present, since all replications are performed with samples drawn at random 
from the same population. The (-test of significance just described is then 
valid for all types of errors that have been randomized for the various replications. 

However, if the purpose of this illustrative experiment is to establish 
the relative effectiveness of the methods for a population of schools of which 
this school is a member, then of course a Type R error may be present. How- 
ever, it will be constant for all replications within this school, and therefore 
will not be taken into consideration in the test of significance. 

If the various replications are thus performed for random samples drawn 
from the same population, in which case there are no Type R errors present, 
the design will be referred to as a "simple" random replications design. 

The random replications design in general differs from the “simple” random 
replications design in that, in the general case, the experiment as a whole 
is concerned with a total population consisting of a large number of sub- 
populations, and that each replication is performed for a different subpopula- 
tion, selected at random from the total population. In each replication 
independently (as in the case of simple replications), all systematic differ- 
ences among the treatment groups, whether due to differences among sub- 
jects or to differences in experimental conditions, are randomized with refer- 
ence to the treatments. 

To illustrate the general case of random replications, suppose that a methods 
experiment in arithmetic is intended to determine the relative effectiveness 
of the methods for “all graded public elementary schools in Iowa," and 
that five of these schools are selected at random. Suppose that within each 
of these schools 30 pupils are selected at random from all fourth-grade pupils, 
and that 15 of these pupils are assigned, either arbitrarily or at random, to 
one of two treatment groups. Suppose also, that after all administrative 
arrangements have been made for these treatment groups, the treatments 
are finally assigned at random to these groups. This experiment may then 
be diagrammed exactly as in the preceding illustration, and the test of sig- 
nificance similarly computed and applied. 

In the design just illustrated, all three of the basic types of error may 
be taken into consideration in the test of significance. In this case, there 
would be a different Type R error in each replication, and the five errors of 
this type represented in the five replications could be regarded as a simple 
random sample from the population of such errors for all schools in the given 
population. Thus, the variance of the distribution of five differences would 


20 FUNDAMENTAL CONCEPTS AND BASIC DESIGNS 


be due in part to the variance of errors due to subject differences, in part 
to the variance of Type G errors, and in part to the variance of Type R 
errors. Since in every case these errors are a random sample from a hypo- 
thetical population of such errors, the test of significance takes all types of 
errors into consideration simultaneously. The only errors which are nol taken 
into consideration in this test are those which have not been randomized 
independently in each replication, or which are constant for all replications. 

Since Type G and Type R errors may be relatively very important in many 
educational and psychological experiments, it is apparent that the random 
replications design represents a very marked improvement over the designs 
considered earlier. 

Factorial Designs: Traditionally, the “ideal” experiment has been regarded 
as one concerned with only a single experimental variable, and one in which 
an attempt is made to “hold constant" the effects of all concomitant or 
extraneous variables. We have already seen, however, that instead of hold- 
ing a concomitant variable constant at only one level, it might be better 
to replicate the experiment at several different levels of that variable. This 
is desirable, not only in order to increase the precision of the experiment, 
but also in order that the interaction (if any) between treatments and levels 
may be studied, and in order that the relative effectiveness of the treatments 
at each of a number of different levels may be determined simultaneously in 
a single experiment. We have seen, also, that the “levels” involved in the 
treatments X levels design need not correspond to different degrees or amounts 
of a single continuous variable, but may represent different non-ordered 
categories in any classification of the subjects. The interest might conceiv- 
ably be greater in the interaction effect than in the main effect of the treat- 
ments. From this it is only a small step to the design in which two or more 
experimental variables may be studied simultaneously in the same experi- 
ment, or in which comparisons may be made simultaneously within each of 
a number of (cross) classifications of treatments. Suppose, for example, 
that an investigation is being made to identify the factors which determine 
the rale of reading a certain type of material at a certain level of compre- 
hension. Traditionally, the procedure would have been to plan a number of 
independent experiments of the single variable type, one concerned only with 
size of type, another with style of type, another with length of line (width of 
column), etc. In the size-of-type experiment, the factors of style of type 
and length of line would be held constant; in the style-of-type experiment 
the factors of size of type and length of line would be held constant, etc. 

A much better procedure, however, might be to vary all these factors in 
a single experiment so as not only to accomplish the purposes of the afore- 
mentioned single-variable experiments, but also to study the possible inter- 
actions among the various factors. Let us consider a specific illustration 
of such an experiment, in which, for the sake of simplicity, only two factors 
(size and style of type) are involved and only two levels of each factor: Roman 
vs. Clarendon styles and eight-point vs. twelve-point sizes of type. We will 


ILLUSTRATIVE APPLICATIONS LE 21 


» e Y 
tep esent style factor and B the size factor. A, will represent 
in type, Aj Glarendon type, B; eight-point and B, twelve-point type. 

| treatme 2 represents twelve-point Roman and A,B, represents 
3 tzpoint A axexdon, etc. Suppose that the same rate-of-reading test is 


fon, twelve-point Clarendon. The experiment would thus be 
concerned with four treatments, corresponding to these four size-style com- 
binations. The available subjects, say 100, may then be randomly divided 
into four equal treatment groups, each taking a different one of these four 
editions of the rate-of-reading test. To insure that all are tested under the 
same conditions, the test might be simultaneously administered to all sub- 
jects in a single group, one-fourth taking one edition, another fourth taking 
another edition, etc. The experiment might then be diagrammed as follows: 
Style 
A, A 


By 


Ma, (- we SE 


Size 
Mz, 


Bs 


Ma, Ma, 

The “main” effect of style, that is, its average effect for both sizes of type, 
would be measured by (Ma, — Maj, and the “main” effect of size by 
(Mz, — Mz,). The “simple” effects of style (the effects for each size sep- 
arately) would then be measured by (Mas, — Mas) and (Ma,s, — Maj); 
and the "simple" effects of size would similarly be measured by 
(Mas, — Maz.) and (Mas, — Man). Finally, the “interaction” effect of 
size X slyle (read "size by style") would be measured by 

(Mar — Map) — (Maig, — Ma,z,)]- 

In this restricted example, a simple t-test could be used to test the signifi- 
cance of any of these "effects." For instance, the main effect of size would 
be tested by 

Ms, — Mz, 


VesUd om, + est’d ditg, 
1 
Ms, — Mz, 


i 


iV (estd din s t est'd oir ag) + (est'd oir ap, t est'd oir "mu 


in which oir etc., would be estimated as the error variance for 


2 
Ape 2M ni 
the mean of any simple random sample. 
Similarly, the simple effect of style for the eight-point size would be tested by 


Mays; — Maza, JK 


est'doi, , + est’d oir 
za i A48, 


22 FUNDAMENTAL CONCEPTS AND BASIC DESIGNS 
and the interaction effect would be tested by 


(Mas, — Man) — (Man, — MaB) 


t= 
Vest'd cir, , + est'dou,, F estd oir, p Festdou, , 
l1 21 12 22 


The problem of the numbers of degrees of freedom for these t's will be 
considered in later chapters. If the samples are large enough to provide 
30 or more degrees of freedom for the t's, the normal probability table may 
be used to evaluate them. 

If the interaction effect proved to be non-significant, one could proceed 
with the analysis on the assumption that there is no true interaction between 
size and style; that is, that for the whole population the effect, of style is the 
same for both sizes, or that the effect of size is the same for both styles. 
In that case, the comparison of the main effect of style (M. A, — Maj) would 
be just as precise and would serve the same purpose as if the same size of 
type had been used with all subjects, and the evaluation of the main effect 
of size — the comparison (Mz, — Mz) — would be as precise and would 
serve the same purpose as if the same style of type had been used with all 
subjects. That is, a single experiment with 100 subjects would serve the 
Same purposes as two single-variable experiments, each employing 100 
Subjects. Furthermore, the fact would have been established that the hy- 
pothesis of “no interaction" is tenable. This could not have been learned 
from either of the single-variable experiments, or from both independently 
considered. 

If, on the other hand, the interaction effect proved to be significant, one 
would have to conclude that the effect. of style is different for different sizes 
of type or that the effect of size is different for different styles. In that 
case, there would be little interest in the “main” effects; the attention would 
be centered instead on the “simple” effects. This experiment, it will be 
noted, consists essentially of two single-variable experiments with style of 
type, in each of which size is held constant. It may also be regarded as 
consisting of two single-variable experiments with size of type, in which 
style of type is held constant at different levels. The whole experiment 
may thus be regarded as consisting of four single-variable experiments, each 
involving 50 subjects. The precision of any one of these experiments, of 
course, is not as high as that of a single-variable experiment involving 100 
subjects. However, the information secured from four single-variable ex- 
periments of 50 subjects each may well be regarded as more valuable or worth 
while than the information from one single-variable experiment involving 
100 subjects. Furthermore, the experiment employing the factorial design 
would have demonstrated that there is an interaction, which again could 
not have been learned even from two single-variable experiments of 100 
subjects each, if one were concerned only with size and the other only with 
style. 


ILLUSTRATIVE APPLICATIONS 23 


It should be clear, then, that this “factorial” design yields far more in- 
formation than could be obtained from a single-variable experiment with 
100 subjects but concerned with either size or style alone. It yields even 
more information than could be obtained from two such experiments to- 
gether, using twice as many subjects. It should therefore be apparent why 
the factorial design and the method of analysis appropriate to it, which are 
due to R. A. Fisher, have often been described as among the most important 
contributions to experimental technique in recent decades. 

The illustration here used was restricted to two levels of each of two factors, 
in order that simple t-tests might be employed to evaluate the results. In 
the general case, the factorial design involves several factors or treatment 
classifications, and the number of subjects may differ from one level to 
another of the same factor. In this general case, the analysis is considerably 
more complex, and simple t-tests of significance are no longer adequate. 
The general case of the factorial design will be considered in detail in Chapter 9. 

We have noted earlier that the treatments X levels and the factorial 
designs have many features in common, and that it is sometimes difficult to 
decide under which of these types a particular design should be classified. 
If one variable or factor is introduced into the design primarily to make 
possible a more precise estimate of the “main” effect of the other factor, 
and if the interaction effect is of only incidental or secondary interest, the 
design may be clearly classified as a treatment X levels design. In this 
case, it is presumably known in advance that the control variable is related 
to the criterion variable, Hence, there would be no point in testing the 
significance of the main effect of the control factor. On the other hand, 
if the second variable is introduced primarily in order to study and evaluate 
its main effect along with that of the first factor, and/or to study the inter- 
action between the two factors, then the design is clearly of the factorial 
type. In this case, it is presumably not known in advance whether the 
second factor is related to the criterion; hence the purpose of introducing the 
second factor is not to increase the precision of the experiment so far as 
the evaluation of the first factor is concerned. It seems worth while, for the 
purpose of the following discussions, to distinguish between these two designs 
wherever possible. However, if in some instances the purposes of the ex- 
periment are so mixed that one cannot readily decide how to classify the design, 
this is of little practical consequence so long as the results are properly analyzed 
and interpreted. 

The ** Groups-Wilhin-Trealments" Design: When the purpose of an experi- 
ment is to establish generalizations about a population consisting of a large 
number of subpopulations, it is sometimes not possible to replicate the 
experiment, or to make all possible treatment comparisons, for each of a 
number of randomly selected subpopulations (as in the random replications 
design). For example, in a methods experiment in arithmetic, the methods 
may be such that if both are employed simultaneously in the same school, the 
pupils under one method may exchange information about the methods with 


24 FUNDAMENTAL CONCEPTS AND BASIC DESIGNS 


those studying under another. The result may be that teachers and pupils 
who are intended to employ an unadulterated form of one method may, on 
their own initiative, introduce or make use of elements of the other method. 
"Thus the results obtained for either treatment group may really be a mixture 
of the effects of both methods. In such cases, it might be better to administer 
only one method in one set of r, schools and only the other method in another 
random set of r; different schools. Such an experiment might be diagrammed 
as follows: 


Treatment A; 


Se Mas] Mes] +» fia] ha] 


1 
Ma, 7 7 (Mays, + Mays, +--+ Mags, ss Mass) 


Treatment As 


m Pe eom Mays, 


Ma, = 1 (Mass, +... + Mas) 


(Note that the school whose mean is M. 4,5; is not the same as the school whose 
mean is M4,s;) 
The appropriate test of the significance of the treatment effect would then be 


Ma, — Ma, 


€ Vest'd oir t + est'd ei, 
in which " E 
LB Mays, MAJ È (Ma,s,— Map)? 
est’'d ey, = cae C and est'd T, = ERA 3 


This test of significance involves the assumptions that the number of 
cases (n) is the same in all schools and that the experimental subjects used 
in each school are a random sample of all pupils who might be given the same 
treatment in that school. It also regards each general treatment mean as the 
mean of a random sample of r means rather than as the mean of a sample of 
rn subjects. Since the schools employing either method are a simple random 
sample of all schools in the population, the Type S or Type G or Type R 
errors in the means for those schools may be regarded as a simple random 
sample from the distribution of such errors that would be found for all schools 
if all used the same method. (If any “extraneous” factor tends to operate 
systematically in favor of a certain method, it would presumably operate 
in the same fashion if this method were widely used in practice; hence, the 
“extraneous” factor might properly be regarded as an integral part of the 
method itself.) Thus the test of significance is valid for all types of variable 


ILLUSTRATIVE APPLICATIONS 25 


errors affecting the criterion means in the various schools. This design is 
much less precise than other designs employing the same number of subjects, 
but since it eliminates any possibility of contamination of one treatment by 
another, it is sometimes preferable to other designs in spite of its lack of 
precision. 

Combinations of Basic Designs: Some of the specific designs used in the 
preceding illustrations are rarely used in exactly the same form in actual 
research. Each illustration represents a highly restricted. special case of a 
more generalized design in which there may be more than two categories in 
each treatment classification, and in which the number of subjects may 
vary from category to category within the same treatment classification, or 
from one level to another. Even in their generalized forms, however, these 
basic designs account only for some of the designs used in practice. Most 
designs actually employed in research represent rather complex combinations 
of these basic designs, or of the elements or principles contained in or repre- 
sented by these basic designs. Just one illustration of these complex designs 
will be presented here. This design provides for two experimental variables, 
A and B, with three levels of A and two of B or with six different treatments 
representing the various combinations of A and B. Every subject takes all 
levels of A; half the subjects do so in combination with B; and half in com- 
bination with B». That is, half the subjects take three of the six treatments, 
the other half take the remaining three treatments. One purpose of the 
design is to counter-balance the effect of the rank order (0) in which the 
various treatments are administered to the same subjects. 'The design is 
liagrammatically represented below: 


A,B, A,B, AsBi A,B, AB AsBy 


Or 
Level 1 02 


Os 


A,B, ABı AsBy 


Level 2 


Level 3 
etc. 


26 FUNDAMENTAL CONCEPTS AND BASIC DESIGNS 


The experiment is performed at seven different levels of a control variable 
L. The number of subjects at each level is a multiple of 6. Within each 
level the subjects are divided at random into six equal groups. Group 1 (G) 
takes treatments A,B, AsB;, and A;B, in that order. Group 2 (G;) takes the 
same treatments in the following order; AsB;, A;B;, and A,B,; while Group 3 
takes them in the order A3B;, AıBı, and AsB,.. A similar pattern is followed 
with the subjects taking the various A-treatments in combination with Bo, 

No attempt will be made here to attach more meaning to this design, or 
to indicate how the results might be analyzed and evaluated. The purpose of 
the example is simply to illustrate, in one instance, the fact that the ideas 
represented in the simple designs thus far considered can be combined to 
form much more complex designs. This particular illustration contains 
elements of the simple treatments X levels design, the simple treatments X 
subjects design, and the simple factorial design. 

There is an almost unlimited number of ways in which more complex 
designs can be devised to fit specific experimental problems and situations. 
If the student is to develop the ability to select or devise for himself the 
designs most appropriate to his own particular problems, facilities, and 
resources, it is essential that he master thoroughly each of the basic designs 
as presented in more highly generalized form in succeeding chapters. 

In concluding this chapter, it may be well to remind the student again 
that he is not expected to have acquired at this point any thorough under- 
standing of the basic designs just presented. They have been presented here 
for introductory purposes only, to provide the student with a brief preview 
of what is to follow, to give him some advance inkling of the many inter- 
esting possibilities in experimental design. Each of these basic designs will be 
presented again in a more highly generalized form in the chapters which 
follow, and will there be accorded much more intensive consideration. Many 
questions which may have occurred to the student in reading this introductory 
chapter had therefore best be left unanswered until these later chapters. 


The Chi-Square, t, and F Distributions 


Introduction 


As has been pointed out in Chapter 1, the purpose of most psychological 
experiments is to test some hypothesis concerning some characteristics 
(parameters) of the populations from which the experimental observations 
may be regarded as random samples. To test any such hypothesis, it is usually 
necessary to know the sampling distributions of estimates of these parameters 
derived from the experimental data. For the experimental designs considered 
in this text, the sampling distributions involved in the tests of significance 
are almost always of the type known as "-distributions." Before giving 
any consideration to the tests of significance employed with these designs, 
therefore, it is necessary to acquaint the student with or to remind him of 
the important characteristics of the F-distribution and of the Chi-Square 
distribution in terms of which F is defined. 


The Chi-Square Distribution 


If a very large number of samples of the same size are independently 
drawn at random from a normal population, if each measure is expressed as a 
deviation from the population mean in units of the standard deviation of the 
population, and if the sum of the squares of these deviations is obtained for 
each sample, these sums will show a characteristic form of distribution which 
is known as the Chi-Square distribution. The definition of Chi-Square 
(xê) is given by 

“OY 2 
ow a) 
in which X is any measure in a sample of n cases drawn at random from a 
normal population, uis the population mean, and c* is the population variance. 
This distribution differs for samples of different sizes, ranging from distribu- 
tions sharply skewed to the right for small samples to very nearly bell-shaped 
distribution for large samples. Figure 1 on page 28 indicates very roughly 
27 


28 THE CHI-SQUARE, t, AND F DISTRIBUTIONS 


the form of the x?-distribution for certain sizes of samples. "There is a differ- 
ent x?-distribution for each possible value of n. For reasons to be explained 
later, the n of (1) is called the number of “degrees of freedom” for the cor- 
responding x?-distribution. 


Relative frequency 


(0 10 20 


x* 


FicunE 1. Approzimale forms of x?-dislribulion 
n for certain degrees of freedom 

The mean of any x?-distribution is always equal to the number of degrees 
of freedom (df), and the mode of the distribution (except when df — 1) is at 
df—2. As the number of degrees of freedom increases, the x?-distributions 
become more and more nearly the same in form; for df >30, the x?-distribu- 
tions may, for most practical purposes, be regarded as having the same 
form. (The mean and variance of the distribution, of course, continue to 
increase with further increases in degrees of freedom.) For df>30, the 
distribution of v/2x3 is essentially normal, and has a mean of V2df -1 
and a standard deviation of 1.00. 

The probability distribution function! for x? has been determined and 
tables have been prepared showing the values of x? exceeded by various 
proportions of the individual x”s in each of the x?-distributions for df = 1 
to df — 30. 

A table of x? for df = 1 to df = 30 is given on page 29. Table 1 tells us, 
for example, that in the distribution of x? with one degree of freedom the 
median value of x? is .455, and that a value of 2.706 is exceeded by 10% 


NECS 
Ly = ye" Gd) ? 


Cs1oqsiqnd pue 1oqjne oq jo uorssrued Aq Pyg pOg pue sayy Aq peusmqnd *s4234041 
usswasay 40f spoyjayy ]oorsipjs “JOYS "V ^W JO TII AGEL, Wosy punda qe) "eAzno [peunrou om Jo [rej o[8uts e jo Jeq} Yy spuodso1ioo ;X 10} 
Kyiqeqoad ayy Jeq} Suriequiaura1 PIVELA jrun YIM ojerop [eunrou B se pasn oq &eur T — UZA —;Xc/A uorsso1dxo oq; ‘u JO sonqeA JOFIE] 10 


£0L'60 7680S  c961V ELLER 9S70F OStT9E OES'EE  9tt'6c BOSST PILET 6690 — £6V8I 9089  £S6 VI oe 
ZOE'8E  899€'6v €699} LESH 1806€ 6ET'SE T9PZE IEEBS LLSFS —SLY'GG 99L60L 8OL'LI  TVLS' SI  9SCYI [74 
€69'0€  82c'8V  6TIVSP LEEI? — 9167€ T6E'TE 9EELE LIVES BBS Ic 6€68I 82691 LES PL S9S'EL 8z 
oLy'ss £969} Obl SILO THL'9E 6IE'OE —9££'9c  6I£cc —£0L'0c PITSI ISTOT STUPI — 62800 Ma 
cS0vS  cY9'SQy OBC BBE COS SE SOLIE OFT OT OEEST COLTS —O0c8'6L GOT LT  62.£€' 8S0  60V€£I — 86ST 9c 
0c9'c vYl&v*Y — 998IY SSO LE — c8t' v6 S190  cLL8c LEEKS 19800 O0F6'ST £LV9I TISHI 2697 POS TL Sc 
6LUIS 0867} OL70P SIV9E  96Ltt SSF 96017 LEEES  tT6'6L  c908L 6S9ST BFR EL FOG IL 9S8°0L VG 
9cL'6V «BEN TP 99685 ZLT'SE LOOTE 6cV'8c — 8109c LEETE — IcO0 61 — L817I 9T9 TL — I60€I  &6c II — 96L0I £c 
99c8v  68c0V 6S9LE  YTc6't6 —€I8'0t TOo&2c  6t6 vc LESTE  IODLST  VICOL THOEL BEETLE 00901 PS6 [4 
L6L'Op ZEGE  tv£&'9t ILE —CI9'6c LLD9c  8€8'£c  :!££'0c  c8U2I —SWVSI OrctI — 16€ II $166 L69'8 1 
SIE'Sh 999€ OZO'SE OTFTE — GIV'9c 8t0'€c SLLCS  L££' 60D 99091 BLS VT EVIE — IS8' 0D — 2866 0978 0c 
O0c9CU — I6L9C 28966 PHLOE — YOC'LG 006'&c 689 Ic BEEBT 9ILET TS9' IL LITO 2998 9° 6L 
ZIETh — C08 TG OPETE 69887 686'ST 09Lcc 10907 BEELI LEBEL €£98'01 — 06t'6 `L SIOL 9t 
06L'0V — 60P't& — S66'0E LES2G —69L' vc SI9' Ic TIS6L — 8€£€'9I 200 ss0'0L  c29'8 "L 98019 Lt 
ccc 6£ 000€  €$£9'6c 96797 «= SHEET SIFOT  SIV9L BECSI SSUII cI£'6 c96L vI9'9 cI8'c 9t 


1601€ BLSOE  69c'8c 966 3c LESS TILGI LI IcLIL L0t00 2098 T97L — C96 St 
fcL9€  IVL6c £2897  C89€c  T90Ic ISLE — cccol TESOL 296 — 0612 TL 9gog'c VW 
BESS 889122  cLV'Sc  cO9tcc  cI9S' 6L €8690L  6ILSI Ones 9566 1£9'8 cio c698'c Sorry BL 
606Zt 1129S SORE  9c0 Tc 699801 TISSE  TIOVI OFEIL 1606 — 208 OE9 TS BLIP zl 
POTTS —CcLYc B19 SL96I SITLL IS9PL 6681 THEOL 88 6869  8LSS SLS? — 609€ p 
Besos  60ctc  I9LIc LOGBT  286€0 HEL T8LIL HES — 2971 — 6119 Sog — 0v6€  650'S ot 
LISLE 99917 62961 61691  T89TI ZWEI  9C901 SHES LOO OBES BOTY Sze" 6 
CzL'9c 06007  99L81 LOSST TOLET OLOTE 156 WEL LSS Poesy e625 GEO 8 
ZEW — CLV8I ZZO — 190 VI  LI0CI — £096 — £998 — 9169 — L9Y —cc8t LITE POST L 
LVE — cI89L SEOST ZOSI SPOOL 8S8 [STL BES — 9c8t OLVE SE9T PETT 9 
LIS0Z 980ST BBEEL OLOTE  9€c6 6872 1909 TSE 000'E SPET SPUI — zer c 
OVS LLTEL 899 8876 6LLL 686S 88r LES GLZ GFL IU — 6c Y 
89791 ShETL 2686 STSL —I6c9 che Soe 9967 FFT SOOT wee, (cor $ 
CI&€I oló  YVc8l 66S OOF 6Ict OFS  90£1 IL — 9W eor — TOLO & 
LEB OL $€9'9 [30255 wet 9076 [4201 FLOT sey [12 cy90* 98SI0* €£6t00  859:07 T 
100° 10° zo so or oz os oc [A og" 06: s6 96 66 ip 
Appqpqoaq 


;Xjo9|qoi “L ATAVL 


30 THE CHI-SQUARE, t, AND F DISTRIBUTIONS 


of the x"s in this distribution. The table shows that for a x? with 6 degrees 
of freedom, the probability is .80 that a randomly selected value of x? will 
exceed 3.07. Again, the chances are 2 in 100 that the x? will exceed 15.033. 

For values of df larger than 30, the normal distribution of v2) may be 
used in lieu of the x?-distribution. 

From the definition of x, it follows axiomatically that the sum of two 
independent x"s is itself distributed as x? with degrees of freedom equal to the 
sum of the degrees of freedom for the x"s added. To illustrate, if the value 
of x°, computed by (1), is obtained for one random sample of 10 cases and 
another independent random sample of 5 cases, the sum of these two values 
of x? must, of course, be the same as the x?, computed by (1), for the com- 
bined sample of 15 cases. This is an extremely important rule in statistical 
applications of x?. 

According to this rule, if B and C are independent of one another and 
both are distributed as x? with b and c degrees of freedom, respectively, then 
A = B + C is also distributed as x? with a = b + c degrees of freedom. The 
converse of this rule is also true. If A and B are each distributed as x? with 
a and b degrees of freedom, respectively, and if A = B+ C and C is inde- 
pendent of B, then C — A — B is also distributed as x? with c — a — b degrees 
of freedom. 

By means of this converse of the rule of sums, it may be shown that if a 
random sample of n cases is drawn from a normal population, but each 
measure is expressed as a deviation from the sample mean instead of from the 
population mean, again in terms of the population e, the sum of the squared 
deviations will again be distributed as x, but now with n—1 degrees of 


freedom. That is, the distribution of LX- My for random samples of n 
g 


2 
cases each is exactly the same as that of DX = for samples of n—1 


g 
cases each, M and „ being the sample and population means, respectively. 
To prove this, we may first, for any single measure (X), write as an identity, 


(X — pw) = (X — M)+ (M — y). 
Squaring both sides of this expression, we have 
(X — p! = (X — My + 2(X — M)(M — p) + (M — y). 
Summing these expressions for the n measures in the sample 
Z(X— uy = $(x — M} + 2(M — 3(X — M) + n(M — py. 


Dividing each side by the population variance (c?) and noting that 
Z(X — M) = 0, we have 


XX- wt  XUX- MY n(M— 
3 xx z T 7 (2) 


g g o 


PROOF OF THE INDEPENDENCE OF THE MEAN 31 


We now note that the second right-hand term of (2) may be written 


n(M =»)? _ (M= 1)? (M-4) 
P a^ /n X OM 


which, according to (1), conforms to the definition of x?, the degrees of free- 
dom for this x? being 1. We note also that the left-hand term of (2) is, ac- 
cording to (1), also distributed as xê, with n degrees of freedom. Granting 
that the right-hand terms are independent of one another, it then follows 
by the converse of the rule of sums that the first right-hand term must also 
be distributed as x?, with degrees of freedom equal to the difference between 
the degrees of freedom for the left-hand and the second right-hand terms, 
namely (n—1). Accordingly, granting independence, we may write 


x= XX EM, af=n-1- (3) 


According to the rule of sums stated earlier, the sum of a number of x?'s 
obtained by means of (3) is also distributed as x°, the number of degrees of 
freedom for this x? being determined by that rule. Suppose again that we 
have a sample of 30 cases drawn at random from a normal population, and 
that we divide it at random into sub-samples of 5, 10, and 15 cases. Sup- 
pose we then compute a x? for each sub-sample according to (3). The sum 
of these x?s will then have 27 df, although it is based on 30 cases. To find 
the frequency with which this x* is exceeded by chance, one must enter 
the x? table with df — 27, even though the number of cases in the sample is 30. 


Proof of the Independence of the Mean and Variance of Random 
Samples Drawn from a Normal Population 


We have not yet proved that the right-hand terms of (2) are independent 
of one another. Before doing so, it will be well to define more exactly the 
meaning of “independence.” Two variates may be said to be independent in 
the probability sense if the distribution of the first is exactly the same for 
all values of the second, and vice versa. Given the paired variates X and Y, 
X may be said to be independent of Y if the distribution of X is the same for 
all values of Y, or if that of Y is the same for all values of X. From this 
definition, it is axiomatic that if X is independent of Y, then aX is also inde- 
pendent of Y, a being a constant; and X — a and X* are each also independent 
of Y. 

Proof that the right-hand terms of (2) are independent rests on the more 
fundamental proposition that the mean and variance of random samples 
drawn from a normal population are independent. Obviously, if the mean 
(M) is independent of the variance [=(X — M)?/n), then, by the axioms 
stated above, n(M — u)? in (2) must be independent of E(X — M}. 

The proof of the independence of the mean and variance is somewhat in- 


32 THE CHI-SQUARE, t, AND F DISTRIBUTIONS 


volved, and some students may prefer to take this proof on faith. In that 
case, they may skip the rest of this section and go on to the section on“ Degrees 
of Freedom." However, the proof provided here involves no mathematics 
beyond that taught in high school algebra, and with a little perserverance 
most students should be able to understand it completely. 

To prove that the mean and variance of random samples drawn from a 
normal population are independent, we must first demonstrate the independ- 
ence of the sum and difference of two measures independently drawn at random 


from the same normal population. For simplicity, we will express each 
X-nu 


measure as a deviation (z = from the population mean in terms of 


the population standard deviation. Suppose, then that we draw a very 
large number of pairs of measures at random from the same normal popula- 
tion, letting z’ represent the first measure drawn and z the second drawn in each 
pair. Suppose we then plot each pair of measures on a scattergram (see 
Figure 2), and for each (extremely small) unit of area on the scattergram we 
erect, a perpendicular proportional to the number of pairs of measures plotted 


First measure drawn in each pair 


Second measure drawn in each pair 


Ficure 2. Scattergram for Pairs of Measures Drawn al Random 
from a Normal Population 


PROOF OF THE INDEPENDENCE OF THE MEAN 33 


in that unit of area. If we then let the number of pairs drawn approach 
infinity, and also let the unit of area become infinitesimally small, the upper 
ends of these perpendiculars will describe a smooth bell-shaped surface, 
which we will call the correlation surface. 

If z and z' are independent, the distribution of z for any given value of 7’, 
or of z' for any given value of z, is the same normal distribution as that which 
characterizes the entire population. Accordingly, the distribution of z 
along the line z! = 0 (Line C in the figure) will be described by the normal 
curve y= ner, yo representing the ordinate at the mean of this curve. 

E 
Accordingly, the ordinate of this curve at the point (za 0) is ye?. The 
distribution of z’ along the line z = z; (Line D in the figure) is likewise 
a normal distribution whose mean ordinate we have just shown to be 


e 
yT. Accordingly the ordinate of this latter curve at the point (z; z;) is 

=} ot wee 
yoe T T= n From this it is evident that the points on the scatter- 
gram for which the frequency (ordinate) is a constant are the points for 
which z? +z} is a constant, that is, for which z?--z?- a. From the 
fact that the square on the hypotenuse of a right angle triangle is equal to 
the sum of the squares on the legs of the triangle, it is apparent that these 
points of equal frequency would lie along the circle whose radius is a and 
whose center is at 0. 

We have now shown that the lines of equal frequency on the scattergram 
are concentric circles with centers at 0, as is indicated in the figure. This 
means that the correlation surface is completely symmetrical about 0, or 
that the line of intersection of the correlation surface with any plane through 
0 perpendicular to the scattergram will always describe the same normal 
curve. It means also that the line of intersection with the correlation surface 
of any plane perpendicular to the scattergram is a normal curve with the 
same variance as that which characterizes the entire population. 

We may next note that all points for which the sum of z and z’ is a con- 
stant lie on a straight line (z + z’ = c) with a slope of — 1.00. The sloping 
line passing through the point A in the diagram, and all lines parallel to it, 
are such lines, It is obvious that the closer one of these lines comes to the 
lower left corner of the figure, the lower is the value of the sum of the meas- 
ures. The closer the line comes to the upper right corner, the higher is the 
value of the sum. 

It is likewise apparent that all pairs of measures for which the difference 
is a constant will fall along a line (z — z’ = k), with a slope equal to + 1.00, 
such as line B and all lines parallel to it in the figure. For every pair of 
measures represented by a point on line B (which passes through the means 
of the z and 7’ distributions), the value of z is equal to the value of z’; hence 
their difference is equal to zero. All points along any line parallel to B repre- 
sent pairs for which the difference is constant. The farther this line from B, 


34 THE CHI-SQUARE, t, AND F DISTRIBUTIONS 


the larger is the value of the difference. It is now apparent that for all lines 
of constant sums, the distribution of (z — z^) is a normal distribution with 
exactly the same mean and variance. In other words, (z 4- z^) is independent 
of (z — z^), or the sum of two measures selected at random from a normal 
distribution is independent of their difference. 

We may next note that for any sample of size n, the sum of the squared 
deviations from the sample mean, Z(X — M)?, is equal to 1/n times the 
sum of the squares of the differences for all possible pairs of measures. That 


nl n 
is, Z(X - My = L 55 » (X; — X; when i takes all values up to n — 1 and 
Jj all values from 2 ion A general proof of this would be quite cumbersome 
so it will be demonstrated below only for the case in which n — 3. If he 
wishes, the student can readily provide similar proofs for n — 4 and n — 5, 
at which point he will be ready to recognize that the relationship holds for 
any value of n. 


Forn-3 


z(X-My- (x 


"ute n (x. S ee me Ay 
(x eter 

-24[0X; — X,— X) + (22a — X1 — X3)! + (2X3. — Xa — Xy] 

= MAXI + Xi + Xi — AX LK, — ANG + 2XsXs 

+4X: + XP XG AXUXs — AXVX; + 2XiXs 

T AXI Xi + XP— AXUXs— 4X-X; + 2XiX 

= 4(2Xi + 2X; + 2X; — 2X,X, — 2X1X, — 2X4X3) 


= 8[QG — X9* + QG - X) + (Xa — 23] 


joo 
= n 23 La: ar Xp- 
i=l j=2 


Now it is axiomatic that if a certain variate is independent of each of a 
number of other variates, then that variate must also be independent of 
their sum. We know that X; X,,... X, are all independent of (X; — X;). 
Hence, according to the axiom, (X; + X,--...-- X,) is independent of 
(Xi — X). We have already shown that (X;-- Xs) is independent of 
(Xı — X). Therefore, according to the axiom, T = (X, + Xa + X3... + Xn) 
is independent of (X, — X;). In the same way, it may be shown that T is 
independent of the difference between any other two measures. Hence, 


DEGREES OF FREEDOM 35 


T must be also independent of the square (X; — Xj) of any of the differences, 


n-i n 


and finally, of the sum 2 > (X; — X)? of the squared differences for all 
i=l j=2 


possible pairs of measures. From this it follows that the mean of the sample 
is independent of the variance. 


Degrees of Freedom 


We have seen that the number with which we enter the table represents 
the number of cases in a sample only when x? is computed by (1). In prac- 
tice, nearly all x?s are computed in other ways, so that the number with 
which the x* table is entered is only rarely the number of cases in a 
sample. 

We see now why it is convenient to call the number with which we enter 
the x? table by some name that suggests its general nature, and we can begin 
to understand the reason for the name “degrees of freedom." When each 
measure in a random sample is expressed as a deviation from the population 
mean, any measure selected is free to take any value. Thus, in a sample of 
n cases, there are n “degrees of freedom,” one corresponding to each obser- 
vation. When each measure is expressed as a deviation from the sample 
mean, the value of the “last” measure drawn is already determined by the 
values of the other measures, or by the restriction that the sum of the devia- 
lions must be zero. For example, in a sample of four cases, we can specify 
any value we wish for three of the deviations, such as —4.0, 2.3, and —1.7, 
but if the sum of the deviations is to equal zero, the fourth deviation must 
be 3.4. We say, then, that the value of x* for this sample has three degrees 
of freedom (df = 3). 

We may state as a general rule that the number of degrees of freedom for 
a x? computed from a number of observations is equal to the number of ob- 
servations minus the number of algebraically independent linear restrictions 
placed upon them. That is, if we have computed a x? from N observations 
on which r linear restrictions have been placed, the degrees of freedom for the 
x is N—r. In general, a linear restriction takes the form of a first degree 
or linear equation specifying a relationship existing among the observations. 
In all the applications of x? represented in this text, the linear restrictions 
represent only simple restrictions upon sums of the measures involved. In 
(3), for example, we have specified that the sum of the deviations must be 
equal to zero, or that the sum of the measures is nM. Accordingly, the 
degrees of freedom for a x? computed by (3) is one less than for a x? com- 
puted from the same sample by (1). 

Any one of a number of linear restrictions is algebraically independent of 
the others if it is not already determined by or is not related to the others. 
Suppose, for instance, that in the case of a sample of 4 measures, A, B, C and 
D, we specify that A-- B —3, C+D=7, A+C=4, and B4 D -— 6. 


36 THE CHI-SQUARE, t, AND F DISTRIBUTIONS 


We have thus imposed four linear restrictions, but only three of them are 
algebraically independent, since, if any three of these relations hold, the 
remaining one is bound to hold also. That is, any one of the four restrictions 
may be derived from the other three, or may be regarded as already determined 
by them. 

Suppose that we have a random sample of N — nr cases drawn from a 
normal population and that this sample is divided at random into r sub- 
samples of n cases each. The number of degrees of freedom for the x? com- 
puted by (3) for any sub-sample is n — 1. Accordingly, the number of 
degrees of freedom for the x? represented by the sum of these x?'s for all 
sub-samples is r(n — 1) = N — r. For example, if there are four sub-samples 
in a sample of 40 cases, N — r = 40 — 4 = 36. We may regard the four 
sub-sample means as constituting a random sample of four cases. If each 
of these means is expressed as a deviation from the sample mean, we may 
compute from them by (3) a x? with three degrees of freedom. If the sample 
mean is expressed as a deviation from the population mean, we can com- 
pute from it by (1) a x? with one degree of freedom. "Thus the x? with 40 
degrees of freedom which could be computed by (1) from the total sample 
has been partitioned into one x? with 36 degrees of freedom, another x* with 
3 degrees of freedom, and another x? with one degree of freedom, all 40 degrees 
of freedom thus being accounted for. 

Just as a number of independent x?s can be combined into a single x’, 
so can the x? for a sample be divided, by a process similar to that just illus- 
trated, into a number of constituent x"s. In the analysis of all the designs 
considered in this text, the total sum of squared deviations from the general 


mean [2(X — M)'] is analyzed into components corresponding to different 


sets of subgroups into which the total sample is divided, or to the different 
sources of variation; the total df is also correspondingly partitioned. We 
will see that with any sample we can, if we wish, identify a component of the 
total sum of squared deviations from the general mean with each individual 
degree of freedom for that total. Under appropriate hypotheses, a x? can be 
computed on the basis of each component of the total sum of squared devia- 
tions. This kind of analysis, which has been termed “analysis of variance,” 
is basic in all the designs considered in this text. It is therefore extremely 
important that the student master thoroughly the x?-distribution and the 
concept of degrees of freedom. 

It should be noted that the general concept of degrees et freedom is broader 
than the meanings attached to it here. The rule for determining the degrees 
of freedom for a sample in terms of restrictions upon the means of subgroups, 
for instance, is adequate for the purposes of this text, but would not serve in 
determining the degrees of freedom for x? in a test of goodness of fit of a 
hypothetical frequency distribution to an observed one. However, for the 
sake of simplicity of presentation, it seems desirable here to attach only as 
many meanings to this term as the purposes of this text demand. 


THE t DISTRIBUTION; THE F DISTRIBUTION 37 
The t-Distribution 


The é-ratio may be defined as the ratio between a randomly selected normal 
deviate expressed in units of the population e, and the square root of a ran- 
domly selected x? divided by its degrees of freedom. Suppose that X is nor- 
mally distributed for a population whose mean is u and whose variance is 


É. [f we select a z at random from this population and 


c? and that z == E 
independently select a x? at random from the x?-distribution for k degrees 
of freedom, we may form a (-ratio as follows: 

z 


P 


Vr 


This ratio has a characteristic distribution for each of the possible values of 
k. Its distribution is much like the normal distribution but is more peaked 
than the normal distribution for small values of k. For k = %, ¢ is normally 
distributed. For k>30, the /-distribution may, for most practical purposes, 
be regarded as normally distributed with zero mean and unit variance. 

Suppose now that we draw a random sample of n cases from a normal dis- 
tribution of X's, compute the sample mean (M), and estimate its standard 


error (cw). The ratio — is then distributed as / with (n — 1) degrees 
est'd er 


of freedom. This is evident if we divide both numerator and denominator 
of the ratio by ew — c/^/n, having substituted for est’d cy its value 


Yo WM: 
VET. The independence of numerator and denominator follows 
n(n — 
from the independence of the mean and sianee of a random sample from 
" n . 1 — Me + . 
a normal population. The ratio saan a in which M; and M, are the 


means of independent random samples of nı and nz cases, respectively, and 


X—-My-T-zX-M)y; 1 dieta 

estd e(u-u) = yas MT te My) (+ h 3 may similarly be 
shown to be distributed as £ with (nı + na — 2) degrees of freedom. It is as- 
sumed that the student using this text is familiar with the applications of 
these tratios in testing hypotheses concerning population means and differ- 
ences in population means. 

Table 2 on page 38 presents the relative frequencies with which various 
absolute values of t are exceeded for df = 1 to df = 30. For example, in the 
t-distribution for 2 degrees of freedom, |/| > .289 80% of the time, |t| > 9.925 


1% of the time, etc. 


The F-Distribution 


Suppose now that we have two independent "distributions, one for dfi 
degrees of freedom, and the other for df. degrees of freedom. Suppose that 


(s10gsqnd pue 1ogjne 
oq Jo uorssyutiod Aq “pry p&og pue 1AHO Âq paysyqnd 's254044 youwasay 40f spoypapy 1721s1)9]S “ISKT v^ Jo AT TL uox poyuudo AVL) 


68166 PEINTE 966S6°T S8VF9'T SSI8C'I £T9£0'T c9 6vv29* Orcs” [ids S6e£eo" 9995 e 
OSL’? LSyG croc L69'T OTET yeg 89° 996" Ler 0t 
95rc Ors troc 669'T II£'T yeg £989 98a" LU [74 
£9rc LVS 8h0'% TOZ'T ele Ssg £89 9S6 Ler 87 
ILLS ELVZ cS0c £0L'T VICI Ssg 1899 9Sc* LU 16 
6LLG 6Lv'c 950°C 90L'T SITET 9c9" 89° 9Sc" pra 97 
L816 S8V'c 090°C 80L'T 9I€T 9c8* 189° 9s" LET Er4 
LOLS cove +907 TILT 9I£'T Lc8* £99" Tes" EITA LeU VG 
108°C 00S's 690% VIL T 6I&I geg £989" TES 9cc* au Ez 
618'c 809% PLOTS LIL'T TEEI 9s9" 989° ces” 9S6 LU [i4 
Tess BISZ 080'c Tent EZET 6Eg' 989° TES LSZ LU Ic 
Spe" 9c6'c 980'c £GL'TI Sce T 098° 189° Ls" Ler oz 
198s GES $60 66L'T 8tE 1 198" 889° LSZ LU [1] 
818° esos TOTE VELI OEST T98 889° L9G" a 8I 
868°C LOSS orrc OPLI EET £98" 689° LSZ 9s LT 
126°C $965 0crc OPLT LECT cog" 069° Stc" BSc" 9cU 9r 
Lv6c c09'c Tele €S2'1 TEI 998° 169° ges” gcc [r2 SL 
LL6'G Tc9'c TOL'T SPST 898" 69° Les" gsc 8c VL 
TIOE 0co'c TLL T OSE L 0:9" Y69* 9cc 6Sc 8c" £I 
SS0'€ 189°S car 9St&'I £19 $69 68S" 65a" 9c as 
90U€ 8IL 96L'T €9£I 9:9" L69' orc 097 67r IL 
69T'E VOLS BTT SIT [24901 618° 0027 cre L6t 092" 6s 0T 
OSTE Ic8'c TITT EEST ESET €88" £02 ers 86E T97" 67T 6 
SSe'€ 968'c 908% 098'I 688° 901” 995: 66° coc og 8 
66V€ 9665 £9t'c S68'l 968 TIL ors cor EIT oer L 
LOL'E ere LVS £P6T 906° sil tSc TOF" So" TET 9 
c£ Sot'€ Lec slo's 026° LoL 6ss° 80r L9c cer S 
To9'r LYLE 91r TETT I6 TiL 69s vir Lc TCU 14 
Iro THS’ core ESEG 816 SOL" voc" ver" LLZ Ler € 
Sc6'6 £969 £087 0026'S T90'T 918" LIS Sry 687 cr [1 
LS9'£9 Tes le 90L ET PIES 91£1 000°T LoL ore See 8st I 

“TO” "60 *60* lt v € Y f 9 Teg Jp 


FSL °S NVL 


THE F-DISTRIBUTION 39 


we select a value of x? at random from each distribution, divide each x? by 
its own number of degrees of freedom, and then form a ratio between the 
two quotients, as follows: 


Xf. 
Xi/df. 


Suppose we do this repeatedly until we have a very large number of such 
ratios. We will then find that this ratio takes a characteristic form of dis- 
tribution. The form of this distribution differs for each possible combina- 
tion of values for df; and df; The exact form of each of these distributions 
is known, and tables have been constructed showing what value of this 
ratio is exceeded certain proportions of the time for each of a large number 
of combinations of degrees of freedom. The determination of the form 
of this distribution is due to R. A. Fisher, and the ratio has been named the 
F-ratio in honor of Fisher by G. W. Snedecor. The F-ratio may be defined 
symbolically as 
Xi/dfı 

F= dfe (4) 

The F-distribution always has a median of 1.00 or less, but the values of F 
may range from zero to extremely large values, depending upon the number 
of degrees of freedom. It is important to note that the numerator and de- 
nominator of (4) must be independent of one another if the ratio is to be 
distributed as F. For example, if the x? in the numerator were always a sum 
of the x? in the denominator plus some other x that is, if the x° in the 
numerator always contained the x? in the denominator, the numerator and 
denominator would show a positive correlation for a series of such ratios. 
In that case the values of F would obviously cluster more closely around 
1.00 than if the numerator and denominator were uncorrelated. In order, 
then, to show that a ratio like that in (4) is distributed as F, we must always 
establish the independence of the two x?'s. 

Table 3, pages 41-44, gives the 20%, 10%, 5%, 2.5%, 1%, 0.5% and 0.1% 
“points” in the F-distribution for each of various combinations of degrees 
of freedom. The “20% point” in an F-distribution represents the value of 
F exceeded 20% of the time in this distribution, or the point on the F-scale 
to the right of which 20% of the distribution lies. The other percent points 
are similarly defined. In Table 2 the 5% value of F for 2 and 6 degrees 
of freedom, for instance, is 5.14. This means that in the F-distribution for 
these degrees of freedom, 5% of the distribution lies to the right of 5.14. 

Frequently the degrees of freedom of the F to be evaluated in an experi- 
ment do not correspond to any of the combinations of degrees of freedom 
for which F is tabled in Table 3. In that case, one may interpolate between 
the tabled values, but the common procedure in practice is to use the F for 
the nearest combination of smaller degrees of freedom than can be found 
in the table. For example, if the obtained F had 10 and 3 degrees of freedom 


40 THE CHI-SQUARE, t, AND F DISTRIBUTIONS 


one would use the value for 8 and 3 degrees of freedom, or if the obtained F 
had 17 and 48 degrees of freedom, one would use the F for 12 and 40 degrees 
of freedom. 

If it is desired to know the value of F below or to the left of which a certain 
percent of the cases fall, this may be determined by simply substituting the 
reciprocal of F for the F read from Table 2 when the degrees of freedom 
are interchanged. For instance, the point below which 5% of the F's fall 
in the distribution for 2 and 6 degrees of freedom is 1/19.33 — .0517, the 
denominator being the F for 6 and 2 degrees of freedom. 

It is easy to show that for F for 1 and k degrees of freedom is distributed 
as ł for k degrees of freedom. Accordingly, the first column of the F-table 
constitutes a condensed ¢-table, giving the value of ?? for the selected levels 
of significance for 40, 60, and 120 degrees of freedom, as well as for all values 
of numbers of degrees of freedom of 30 and below. 

It should be apparent from the definition of F that the ratio between the 
estimates [Z(X — M)?/(n — 1)| of the population variance derived from 
two random samples drawn from the same normal population is distributed 
as F. Accordingly, given the variance estimates obtained from random 
samples drawn from different populations, we may, on the assumption that 
the populations are normal, test the hypothesis that the populations have 
the same variance. It should be noted that in this case, if the F-ratio is 
always formed by putting the larger estimate in the numerator, the F which 
is significant at the 2% level is that which exceeds the 1% point in the table, 
that which exceeds the 5% value in the table is significant at the 10% level, 
etc. 

The F-ratio, as we shall see, provides the basis for nearly all the tests of 
significance in the designs we shall consider later in this text; the logic just 
presented should therefore be thoroughly mastered by the student. 


THE F-DISTRIBUTION 41 


TABLE 3 
Percent Points in the Distribution of F 
dí, 
5 2 3 4 5 6 8 12 24 co 
1 | 0.17; | 405284 500000 540379 562500 576405 585937 598144 610007 623497 636619 
08% 16211 20000 21615 22500 23056 23437 23925 24420 24940 25465 
1 % | 4052 4999 5403 5164 5981 6106 6234 6366 
2.5% | 647.79 799.50 864.16 899.58 921.85 956.66 976.71 997.25 1018.30 
5 161.45 199.50 215.71 224.58 230.16 238.88 243.91 249.05 254.32 
10 %| 39.86 49.50 53.50 55.83 57.24 5944 60.70 02.00 03.33 
20 %| 9.47 1200 13. 13.73 1401 14.50 1490 1524 15.58 
2 | o1 | 9985 999.0 999.2 999.2 999.3 999.5 
0.5 | 198.50 199.00 199.17 199.25 199.30 
1 99.00 99.17 99.30 
2.5 38.51 39.00 3917 3925 39.30 
5 18.51 19.00 19.16 19.25 19.30 
10 8. 900 916 924 9.29 
20 3.560 — 400 416 4.24 4.28 
3 | 01 | 167.5 1485  14Ll 1371 134 
0.5 55.55 49.80 47.47 46.20 45.39 
1 3412 30.81 29. 28.71 
2.5 1744 16.04 1544 1510 1489 
5 1013 9.55 9.38 912 9.01 
10 5.54 546 539 534 5.31 
20 2.08 289 — 294 296 297 
4 | 01 74.4 61.25 856.18 53.44 51.71 
0.5 31.33 20.28 2426 23.16 2246 
1 21.20 18.00 16.69 15.98 15.52 
2.5 12:22 1065 9.98 9.36 
5 TTL 6.94 6.59 6.39 — 6.26 
10 454 — 432 419  4ll 405 
20 2:35 — 2.47 248 248 248 
5 | o1 47.04 36.00 33.20 31.09 29.75 
0.5 2279 1831 16.53 15.56 1494 
1 16.26 13.27 12.06 11.39 10.97 
2.5 1001 843 7.76 739 7.15 
5 661 5.79 541 5.19 5.05 
10 406 378 362 3.52 3.45 
20 218 226 225 224 2.23 
6 | 01 35.51 27.00 23.70 21.90 20.81 i 
0.5 1804 1454 1292 1203 1146 T 
1 13.14 10.92 9.78 915 8.75 j 
2.5 881 7.26 660 623 5.99 E 
5 5 514 — 476 453 439 y 
10 378 346 329 318 3.11 2 
20 207 213 211 209 2.08 1,99 
7 | 01 29.22 21.69 1877 17.19 16.21 . 1273 11.69 
0.5 1624 1240 10.8 10.05 9.52 f 7.65 7.08 
T 12.25 9.55 8.45 7.85 7.46 9 i 6.07 5.65 
2.5 807 654 589 552 529 512 490 442 — 4.14 
5 5.59 4314 — 435 412 397 387 373 341 893 
10 3.59 326 307 296 288 283 275 2:88 — 247 
20 200 204 202 199 1.97 196 193 187 183 
8 | 01 25.42 1849 15.83 14.39 1349 12.86 1204 1030 — 934 
0.5 1469 1104 9.00 881 830 795 7.50 6.50 — 595 
H 1126 8.65 7.58 701 6.63 6.37 6.03 5.98 — 4.86 
2.5 757 600 542 505 482 465 443 2 3.95 — 3.07 
5 532 446 407 3.84 369 3.58 3.44 328 312 2.93 
10 346 311 292 281 273 267 259 2.50 240 229 
20 195 198 195 192 190 188 186 183 179 174 
9 | 01 22,86 16.39 13.90 12.56 1171 1113 1037 9.57 872 7.81 
0.5 i361 101] 872 790 747 713 6.00 623 573 519 
1 lose 802 6.99 642 606 5.80 547 51l 473 431 
2.5 721 571 508 472 448 432 410 387 361 333 
5 512 426 386 363 348 337 323 307 290 271 
10 336 301 281 269 261 255 247 238 228 2.16 
20 19] 194 190 187 185 1.83 180 176 172 ^ 187 


Table 3 is abridged from Table V of R. A. Fisher. and F. Yates, Statistical Tables for Biological, Agricultural, and 
edical Research, published by Oliver and Boyd Ltd. by permission of the authors and publishers. The 0.596 
and 2.5% points are reprinted by permission from Biometrika, vol 33 (April, 1943), pp. 73-88 ("Tables of 
Percentage Points of the Inverted Beta (F) Distributio: 


THE CHI-SQUARE, t, AND F DISTRIBUTIONS 


42 


TABLE 3 (cont.) 


24 


12 


eUamEAS 
SOROSAR 


28258 


Soda 


Seg5aqge 
8282335 


25 ae 
ao 
MESE 


EEEIE EE] 


aeeeres 


RETTERE 


S59$95 


CECOT p 


seesuce 


CEEDED] 


2982835 


SSESSSS 
PE 
enemas 
Kb ei 
DS Ls 


gc 
Bar 


DSANN 


EREBGFETS 
Eii Btk {a 


Sood 


ASSSERS 


Nig sdeid 
FI 

QICowae 
Tusci 


EE 
assaig 


eisss22 
ERI sd 


Hanoi 


2299 


2e82420 
SSOZA5 


eoi ecc 


mororo 
EDU 


[-1-1-1-1-1-1-] 
TESI 
EIL peik 


S598239 


Sidodadi 


ONNO m 
SSS8a2n 


Tuo dcoic 


EIE EDGE] 
icc 


Omer 
leri 


asda 


328392 
disada 


29388 


desear 


Eizo 
EZSEXERSC 
Godd 


RHSESAS 
REOR 


SSERSSR 
idc 


EGER 


CITEC OT] 


Ei 
SSRs) 
SOS dad 


ASBZAQse 


TOGA 


9 
$ S539 


rivo doc 


ESTSRIR 
p 


EDI 
row jd 


ELITSE 


ON 


S809 


aA 


eo 
Cabe Ida 
CETT peia] 


RaRSRSS 
LATI 


Nona 
OON 


dde 


Z$85ERÉ 


Sidni 


3882885 


ido oon 


z9SSETS 
Qe Cia 


ngaso niooo dOdo450 SSSA dodó44d HSGSA 
EE EE ELE 49 5 Me g 

Góudooco Sond onc 3S 

S Er dnog SOndooa Góedocg Sondeo 

2 = 9 

a = ^" a 5 


43 


12 


THE F-DISTRIBUTION 
TABLE 3 (cont.) 


$3558Z9 8 


o ddcc 


2239 


gerager 
SAROSA 


KALET Ta 


eza3s*$- 


958389598 2 
dicic 


SS3BESÉÍS 


rn os oic 


sESSSSE 


ced 


SSBRRS 


Seded 


egren 
S$595925$ 


d eioicind 


8328S 


eepe e 


deci 


3382899 


rue odcic 


8958898 


SINSEN 


agassa 


idea 


Beescoua 
$6 


SRNGSAEX 


e doadc- 


582 


SOQSRARS 


EE EE ED 


Baas 


E 
RRRS 


CERETI 


AO e oos 
ANBSans 


rn oci i 


29958888 


oci doc 


& 8882388 
derddan 


35 5 
Souci. 
2R 


EE 
Só-sdeooo 
EE! 


ELE 
OIN 


df, 


23 


25 


2 
a 


12 


THE CHI-SQUARE, t, AND F DISTRIBUTIONS 
TABLE 3 (cont.) 


AVAAISH Sgin so929o*9 29999 
S*SROYA ASHSBRA SSSSSRS 8 


~ognons 
SRzERX g 


MOMMA ddd cio oid M coin 


m Molo ragon Ina 
SISIE 3 BERÉSSS EIASRES Z45854* 


ELEM ie ILL I T ET MM 


BASSE SEBHSSSZ LEAS SS5ERSST 


2389813 SaShess nfGcHi5 saegesa 2 


SHANA waa wield N 


banean 
S3828582 s5489352 


PEE Sb ggexges 
GOAN CHANG io 


SARS 


r*g*ssg utscmiz keinen Ga 94 usesaca 
EAS8*8*R SESSRIÓS ERREASS HARZ zE82895 


Seti ddt Midd Nida cado dici 


44 


BWSR ROA same ye T og: age 
ASSENSE sXHSSZRE SISASRS ALBERS BASIRI 
[fo Goto i NOI et qoro HONATANA  —0 iu oO ONGS 
SSESESERESELES 
bears 49 € ao m 4e » 
EEEEM EE " 
s E] 3 8 8 
3 
-— 


2.33 
0.50 
0.88 


For any 


2.05 
0.92 


—2.33| —2.05 | —1.88 | —1.75 | —1.64 | —1.55 | — 1.48 | 1.41 | 21.34 
0 2.33 | —2.05 | —1.88 | 1.75 | —1.64 | —1.55 | -1.48 | —1.41 


1.88 
0.95 


1.75 
0.99 


1.64 
1.04 


1.55 
1.08 


STUDY EXERCISES 


The table below describes a normal population of z-scores. 
-score in the body of the table, the corresponding numbers in the left and top 


margins represent the tens and units digits, respectively, in the percentile rank 


1.48 


—0.74| —0.71 | —0.67 | —0.64 | —0.61 | —0.58 | —0.55 
1.13 


1.41 
1.18 


—0.77 


1.34 


1.28 
—0.25 | —0.23 | —0.20 | —0.18 | —0.15 | —0.13 | —0.10 | —0.08 | —0.05 | —0.03 


—0.52 | —0.50 | —0.47 | —0.44 | —0.41 | —0.39 | —0.36 | —0.33 | —0.31 | —0.28 


—0.84 | —0.81 
—1.28 | - 1.23 


4 
3 
2 
1 


STUDY EXERCISES 45 


of the z-score. For example, the z-score whose percentile rank is 72 in this 
normal population is +0.58. A random sample may be selected from this 
population through the use of a table of random numbers, as follows: Make a 
“blind” selection of a two-digit number from a table of random numbers 
(see Appendix). Regard this number as a percentile rank in the z-score 
distribution, then read the corresponding z-score from the table above. For 
example, if the random number is 33 the corresponding z-score is —0.44. Con- 
tinue this process until a random sample of the desired size is drawn. 

Following the procedure outlined above, draw a sample of four z-scores at 
random from the normal probability table. 


a) Compute x? for your sample, using formula (1). 
b) Compute x? for your sample, using formula (3). 


c) What is the probability, in any single instance of this kind, of obtaining a 
value of x? larger than that obtained in (b) preceding? 


2. The 12 z-scores in the upper part of the table below were selected at ran- 
dom from a table like the preceding in the manner described in Exercise 1 and 
were randomly assigned, three to each of four subsamples. The results of 
certain computations for each subsample are given below the z-scores (z repre- 
sents the mean of the three z-scores in the subsample). 


Subsamples 
1 2 3 4 


—1.86 46 —1.28 -191 
—- 42 —.75 DIG ES TOOT 


— 59  —.06 iii ese) 
m 
Zz -2.87 —.35 213 —-371 T=2z =—4.80 
zz 3.98 A8 9.11 5.29 M= —4.80 = — 40 
(22)?/3 2.15 04 1.51 4.59 12 
D=) 1.23 NC 6.60 A0 


a) For the total sample of 12 cases, the value of x’, taking deviations from 
the population mean [formula (1)], is x? = 18.16. If the deviations are 
taken from the mean of the 12 scores [formula (3)], x3 = 16.24. What 
number of degrees of freedom is associated with each of these x"s? Are 
these x"s independent? Explain. 


1 A worth-while exercise for a class of N students is to have each student, compute 
100/N x?'s in this fashion, all for samples of 4 cases each, and then to tabulate the dis- 
tribution of these 100 x?'s to show empirically what a x* distribution means. 


16 THE CHI-SQUARE, t, AND F DISTRIBUTIONS 


b) Compute a x? for each subsample separately, taking deviations from the 
population mean. How many degrees of freedom for each? How are 
these subsample x?'s related to x;?? 


c) Compute a x* for each subsample separately, in each case taking devia- 
tions from the subsample mean. The sum of these x"'s is distributed as 
x! with how many degrees of freedom? 


d) Regarding the four subsample means as a random sample from a popula- 
tion of such means (each based on three scores), obtain the x? associated 
with the deviations of the subsample means from the population mean. 
Relate this x? and the x's obtained in (c) to xj?. 


e) Compute the x? associated with the deviation of the subsample means 
from the sample mean (M). How does this relate to those already given 
or obtained? 


f) Compute the x* associated with the deviation of the sample mean (M) 
from the population mean, and relate it to the previous x?’s. 


3. In 2b we have obtained four independent values of x?. Form the ratio of 
the first to the second. Show that this is an “F” ratio. Determine the rela- 
live frequency with which this ratio would be exceeded in a very large number 
of pairs of independent samples of 3 cases each drawn at random from this 
population, if in every case the ratio were formed by placing the larger of the 
two x"'s in the numerator. 


4. How can an F-ratio be formed using the x?'s obtained in 2c and 2e? How 
frequently is this result exceeded by chance? 


5. A z-score is drawn at random from a normal population of such scores 
and its square is recorded. This process is repeated a very large number of 
times and a frequency distribution made of the resulting squared z-scores. 
What will be the mode of this distribution? What will be the median? How 
does the distribution relate to x?? 


6. Show that F with 1 and k degrees of freedom is equal to the square of t 
with k degrees of freedom. Check the tables to verify this relationship. 


"RS A. mm 


* aid 


— a G MM M 80m 0 CEP 


The Simple-Randomized Design 


The Importance of the Simple-Randomized Design 


The importance of the simple-randomized design in experimental work 
cannot be overemphasized. Not only is the design widely employed by itself, 
but it constitutes a basic unit in nearly all of the more complex designs 
employed in experimental research. The treatments X levels design, for 
example, may be regarded as consisting of a number of “replications” or 
repetitions of the simple-randomized design, one for each level of the con- 
trol variable. The random replications design likewise consists of a number 
of simple-randomized experiments, one performed for each of a number of 
subpopulations randomly selected from all those constituting the parent 
population. A factorial experiment, again, may be regarded as consisting 
of a number of simple-randomized experiments concerning one of the factors, 
each experiment performed for a different level (or combination of levels) 
of the other factor (or factors). 

It is therefore extremely important that the student achieve a thorough 
mastery of the simple-randomized design. Accordingly, a special effort has 
been made in this chapter to make all of the essential mathematical theory 
readily accessible to the student not trained in mathematics, to anticipate 
and to answer as many as possible of the questions that might occur to 
him, and to smooth out possible difficulties by liberal illustration and explana- 
tion. The chapter is to be regarded not as an independent discussion of a 
particular design, but as an essential introduction to the succeeding chap- 
ters. Until the student achieves a thorough understanding of this chapter, 
he cannot hope to grasp the full implications of the chapters to follow. 


The Hypothesis to be Tested 


The simple-randomized design is that in which each treatment is inde- 
pendently administered to a different group of subjects, all groups having 
been originally drawn at random from the same parent population. After 
the treatments have been administered these groups may be regarded as 

47 


48 THE SIMPLE-RANDOMIZED DESIGN 


random samples from a single population only if the treatments all have 
identical effects on the distribution of criterion measures for the population. 
Otherwise the group receiving Treatment A, may be regarded as a random 
sample from an imaginary or hypothetical population which is like the 
parent population except that all its members have now received Treatment 
Ai The sample that received Treatment A» may, likewise, be regarded as 
a random sample from a population like the original, except that all members 
of this population have received A», etc. We must therefore think of a num- 
ber (a) of different hypothetical populations, each of which may be regarded 
as generated from the parent population by administering the given treat- 
ment to all of its members. We will hereafter refer to these hypothetical 
populations as the "treatment populations." The hypothesis we wish to 
test is that the criterion means of these treatment populations are identical. 
In the subsequent discussion, we will refer to this hypothesis as the “over-all 
null hypothesis" to emphasize the fact that it is concerned simultaneously 
with all of the treatments, as contrasted with more specific null hypotheses 
concerned only with single pairs of treatments. 


Limitations of the Simple t-Test 


We have seen (page 13) that in a simple-randomized experiment involving 
only two treatments, a t-test can be employed to test the null hypothesis 
concerning the treatment population means. Suppose, however, that several 
treatments, say four, are involved. To simplify the illustration, we will 
suppose also that the number of cases is the same in each group. We are now 
interested in the over-all null hypothesis that all four treatments are equally 
effective, or in the possibility that all observed differences among treatment 
means are due to sampling fluctuations. It might seem that we could test 
this hypothesis by applying the t-test successively to each of the six possible 
differences between two treatment means. Presumably, if any one of these 
differences proved to be significant, the over-all null hypothesis would have 
to be rejected; if none proved to be significant, the hypothesis could be 
retained. Obviously, the one of the six differences most likely to be sig- 
nificant is the largest. The simplest way to test the over-all null hypothesis 
would then seem to be to select the largest observed difference and apply 
a simple t-test to it. This, however, is not a valid procedure. 

In applying a simple t-test to a difference between the means of two inde- 
pendent random samples, we read from the table for ¢ the probability that 
the observed value of ( would be exceeded in any single randomly selected 
instance if the null hypothesis were true. The probability that a single 
randomly selected difference will exceed a given value, however, is by no 
means the same as the probability that the largest of a number of randomly 
selected differences will exceed this value. This should be obvious from a 
simple analogy. Suppose that a single card is selected at random from a 


TEST OF THE OVER-ALL NULL HYPOTHESIS 49 


deck of playing cards. The probability that the card selected will be an 
ace is ys or .07+. Now suppose that five cards are drawn at random, one 
at a time, from the complete deck, each card being replaced before the next 
is drawn. The probability that at least one of these cards will be an ace 
is of course very much higher than .07. Actually it is .33—. 

The correct probability is computed as follows: The probability that any 
given card dealt will be an ace is 4. The probability that it will be other 
than an ace is 1$. The probability that both of two cards selected at random 
will be other than an ace is 1$ X 14. This follows from the fundamental law 
of probability which states that the probability that two independent events 
will occur together is equal to the product of their separate probabilities. 
Similarly, the probability that all the five cards selected at random will be 
other than an ace is (13)? = .6702. Accordingly, the probability that one or 
more of the cards will be an ace is 1 — .6702 = .3298. 

In any simple-randomized design, in order to use the t-ratio of the largest 
observed difference among treatment means as a basis for testing the over-all 
null hypothesis, we would need a special table for t showing the probability 
that the largest difference among the means of the given number of random 
samples of the given sizes would exceed the observed value. Thus we would 
need a different f-table for each of the almost countless combinations of num- 
ber of treatments and size of sample, noting that in the general case, the 
size of each treatment group may differ from that of every other. Even 
though we had such tables, however, this type of test would be unsatisfactory, 
since it is concerned only with one (the largest) of the observed differences 
and ignores the information contained in the other differences. Fortunately, 
this problem has a much better solution, which will be explained in the 
following sections. 


The Test of the Over-all Null Hypothesis 


The Steps in Testing an Hypothesis: In testing an exact hypothesis about 
a population on the basis of data obtained from random samples drawn from 
that population, the generally accepted procedure is as follows: (1) Obtain a 
measure of discrepancy between the hypothesis and the sample observations. 
(2) Determine the sampling distribution this measure of discrepancy would 
have if the hypothesis were true. (3) Decide what risk we are willing to 
take of rejecting a true hypothesis. (4) In accordance with this decision, 
mark off regions of rejection and acceptance in the hypothetical sampling 
distribution. (5) Hither reject the hypothesis if the observed discrepancy 
falls in the region of rejection or retain the hypothesis if the discrepancy 
falls in the region of acceptance. We shall now follow this procedure in 
testing the over-all null hypothesis for a simple-randomized experiment. 
(If the student is not already familiar with these steps, they should become 
clear as he works through the following discussion.) 

The Measure of Discrepancy: Before defining our measure of discrepancy, 


50 THE SIMPLE-RANDOMIZED DESIGN 


we may note that if all treatments have identical effects on the distribution 
of criterion measures for the parent population, that is, if the distribution 
of criterion measures is the same for all treatment populations, these popu- 
lations may be regarded as just one population. In this case the various 
treatment groups may be all regarded as simple random samples from the 
same population, whose variance we shall denote as o?. We shall see that 
from the experimental data we can derive two independent and unbiased 
estimates of c? — one estimate based on the differences among the observed 
treatment means, the other upon the variance of the measures within the 
individual treatment groups. We can then form the ratio of the first of 
these estimates to the second. If the treatments have identical effects, the 
first of these estimates will exceed the other only by chance. If the hypothesis 
is false, that is, if the treatments really differ in effectiveness, then the differ- 
ences among the observed treatment means will be larger than they would 
otherwise be, as will the estimate of c? derived from them, so that the ratio 
will then tend to be larger than 1.00. The greater the differences among 
the observed treatment means, the larger this ratio will be. Accordingly, 
we can use the ratio as a measure of the discrepancy between hypothesis 
and observation, and if we can discover the sampling distribution of this 
ratio, we can use it in a statistical test of the hypothesis. 

Our first step will be to derive an estimate of c? from the observed treat- 
ment means. We shall consider first the case in which the number of cases 
(n) is the same for all treatment groups. If the treatments have identical 
effects on the distribution of criterion measures for the parent population, 
the various treatment groups are all random samples from the same popula- 
lion, and the means of these treatment groups are a simple random sample 
of a cases from a population of such means. We shall denote the variance of 
this population distribution of means as ey?. We can then secure an un- 
biased estimate of cx? as follows: 


a 


22M; — My 
est'd: oa = = —  — 
EST 


in which M; is the mean of the jth treatment group and M is the general 


2M; 
mean for all groups\ y = =}. We know, however, that g,2— g?/n, 
a 


Hence, by substituting in the preceding expression, we get 


est'd o° ZM; - My 
"nut Weg queen 
or 


n} (M;— My 
est; g^ = -= 


(rial 


TEST OF THE OVER-ALL NULL HYPOTHESIS 51 


Thus we have a way of estimating o? from the obtained means of the treat- 
ment groups. We shall denote this estimate as est; c?, to distinguish it from 
the estimate next to be secured. 

We next note again that if the treatments have identical effects on the 
criterion distribution, each treatment group is a simple random sample from 
the same population as any other, so that for any treatment group (j), we can 


secure an unbiased estimate of c? as follows: 
2 
eso i 20 M9: 

nm 


However, we can secure a beller estimate by averaging these estimates for all 
of the a treatment groups. That is, 


2 ee z(X- My 
eso =F mc t aT 


$e BM] 


Ti 


E-m 


24 t 
TE &a(n- 1) 

We thus have two estimates of o°, one obtained from the treatment means, 
the other from the individual measures within the individual treatment 
groups. We can then form a ratio, est; o?/est» o?, between these two estimates. 
If the treatments have identical effects, we would expect this ratio to exceed 
1.00 only by chance. If the null hypothesis is false, we would expect the 
ratio to be systematically larger than 1.00. Clearly, therefore, this ratio 
measures the discrepancy between our over-all null hypothesis and the 
experimental observations. 

The Sampling Distribution of the Measure of Discrepancy: To determine 
the sampling distribution of this measure of discrepancy, we must first make 
certain basic assumptions. These are 


1) The experimental treatment groups were originally drawn at random 
from the same parent population.! 


1 On the basis of these three assumptions, we shall show that our measure of dis- 
crepancy is distributed as F (see pages 52-53) if the null hypothesis is true. Strictly, while 
these assumptions are together sufficient, the first is not entirely necessary for this pur- 
pose alone. That is, the measure of discrepancy is distributed as F if each treatment 
group is drawn at random from its own “treatment population” (and if assumptions 2 
and 3 are satisfied also), whether or not the treatment groups are alike before adminis- 
tration of the treatments. Our real interest, however, is in the hypothesis of equal 
trealment effects, not just in the hypothesis of equal criterion means for the populations 
following administration of the treatments. Equal criterion means at the close of the 
experiment imply equal treatment effects only if we can assume that the populations 
were alike before the treatments were administered. Accordingly, we have specified 
more than is really necessary for an F-distribution alone, and shall follow a similar 
practice with the F-tests used in later designs. 


52 THE SIMPLE-RANDOMIZED DESIGN 


We have noted that, after administration of the treatments, each 
group may be regarded as a simple random sample from a different 
"treatment population." 


2) The variance (o?) of the distribution of criterion measures is the same 
for each of these treatment populations. 


3) The form of each of these distributions is normal. 


Our measure of discrepancy is 


nÈ (M; - My 

"ay tem 

etc XX -My 6) 
ENEA 


in which N = na is the total number of cases. 

The usefulness of this form of the ratio will become apparent later. If 
we divide both numerator and denominator of (5) by o?, we may rewrite it 
as follows: 


T (6) 
MOL UE amp 


j= 


We note that iw.- — M. represents the sum of the squared deviations 


“ 


from the mean in a sample of a cases, each “case” in this instance being a 
group mean. We note also that if our hypothesis is true and our basic assump- 
tions are satisfied, all a groups are drawn at random from the same popula- 
tion. In that case, c?/n is the population variance of these group means, 
since each group mean is the mean of a random sample of n measures. Ac- 


iw.- - My 

cordingly, RI, 3 is, on the assumption of normality and by the 

proposition (3) stated on page 31, distributed as x? with a — 1 degrees of 
freedom. 

In the denominator of this ratio, we note that by assumptions 1 and 3 

xX- M’. e eee , 
the term ETE EUR also distributed as x?, this time with n — 1 degrees 
Sons My’ 
c 


i=l 


of freedom. Accordingly, the sum, , of a such terms must be 


distributed as x? with Hen — 1) = a(n — 1) = N — a degrees of freedom. 
(It should be apparent that this is true soiree n is variable or constant.) 


TEST OF THE OVER-ALL NULL HYPOTHESIS 53 


Thus, if our hypothesis is true, the ratio between the two estimates of the 
population variance (6) may be regarded as a ratio between two values of 
x2, each divided by its own degrees of freedom. If we can show that these 
x"s are independent, it will follow that this ratio is distributed as F, again 
granting that the hypothesis is true. 

We have already observed (page 35) that the mean and variance of a simple 
random sample drawn from a normal population are independent of one 
another. If our hypothesis is true, it follows from this, for any single treat- 
ment group, that n(M; — M)? is independent of X (X — M;?. The sums of 
these two expressions for all treatment groups must also be independent of 
one another. That is, 3:n(M; — M)! is independent of ETO — M; and 
the x? in the numerator of (6) is independent of that in the denominator. Ac- 
cordingly, under a true hypothesis, our measure of discrepancy is distributed 
as F. That is, we may write 


Yn, - My 


fat 
jj Gly ao 
LL -My 


N=a 


The preceding proof that under a true hypothesis the ratio of the two esti- 
mates of the population variance is distributed as F applies only when the 
number of cases is the same for all treatment groups. However, it may 
readily be shown! that this ratio is still distributed as F when n is not con- 
stant. In its more general form, the ratio is written 


1 To prove 
: 
DM; — My 
=i 0 is distributed as x* with a — 1 df, 


let u = population mean, then 


KENE is distributed as x? with 1 df. 
c?/nj 
Hence, " 
QM cay M-d, p Ma- 4] inan-2 
e?/ny o*/ns E c? /n, g? 


is distributed as x? with a df. 
Now, if we impose the restriction on the preceding expression that the deviations 


be computed from the sample mean (M) rather than from the population mean (x), 
one degree of freedom is lost, and 


54 


THE SIMPLE-RANDOMIZED DESIGN 


LR -m 


a= , df = (a — D/(N — a), (D 


Se Mj 


al 


N-a 


in which n; is variable. 


The Measure of Discrepancy as a “Mean Square” Ratio 


For convenience we shall henceforth let $ S X — Mj) be represented by 
This is 


S88», and will refer to it as 


“the sum of squares for within-groups.” 


an abbreviation for 


“the sum of the squared deviations of the individual 


measures from their respective group means.” 


Xn(M; — My = ssa as 


We shall similarly refer to 


"the sum of squares for treatments," which stands 


for “the sum of the weighted squared deviations of the individual treatment 


means from the general mean. 


The degrees of freedom for ss4 and ss, are 


(a — 1) and (N — a), respectively. For convenience also, we shall refer to 
the numerator of the ratio in (7) as the “mean square" for treatments (ms) 
and to the denominator as the “mean square” for within-groups (ms,). This 
notation may be summarized as follows: 


Ssa = Dam, — My 
-YXG-My 


jest 


dfa = y zu 
df, = (N — a) 
Em — My 
jc NUT 
msa = “a= da 
D M "m 
Ew IET df. 


Don(M; — M} 
j=l 


A is distributed as x? with a — 1 df. 


LN 


* LLX- My 
The proof that EAE is distributed as x? has already been given, hence 
the ratio in (7) is one between two x?s each divided by its degrees of freedom, and is 
distributed as F. 


THE MEASURE OF DISCREPANCY AS A "MEAN SQUARE" RATIO 55 


In terms of this notation, our measure of discrepancy is given by 
ms. 
P= 54, df= (a D/U - 9, 
and may be referred to as the “mean square ratio” for treatments and within- 
groups. 

Computational Procedures: In order to apply this F-test in an actual experi- 
ment, we must compute ssa and ss». We shall now derive more convenient 
formulas for the computation of these terms. 

If we let the totals for a single group and for all groups collectively be 

Pi 234. 
represented by T; = JX and T = 27 22X, respectively, we may write 
j=l 
2 


ssa = 2} n;(M;- M) = ques x 
a £N N., 


This reduces to 
ssa = 2, — NW’ (8) 


which represents in general the most convenient way of computing ss4. 
To derive a computational formula (10) for ss», we note next that for a 
single measure in group j 


(X — M) =(X —- Mj) + (M; — M). 
Squaring both sides of this equality, we have 
(X - My = (X - M)! + 2X — Mj(M; - M) + (Mi - My. 
If we now sum such expressions for all the n; measures in group j, we get 
nj nj nj 

EX- MP = E(X- M)’ + 20M; - M)2IX - M) + nM; — M). 
Since yx — Mj) = 0, this reduces to 

aj n; 

XX - My = XX - Mj + nM; - My. 


Summing such expressions for all a groups, we get 
Gand, sti a 
XLX- My = LLX- M) + Deni — My. 
d=! j=] s 


We will call EEA — M)? the “total sum of squares,” meaning “the sum 
j=l 


56 THE SIMPLE-RANDOMIZED DESIGN 


of the squared deviations from the general mean of all measures in all groups," 
and will let it be represented by ssr. We may then write ssr = ss, + ssa. 
Thus we see that the sum of the squared deviations of the individual 
measures from the general mean for all treatment groups may be analyzed 
into the two components needed in the two terms we wish to compute for 
our F-test. 
It is much easier to compute ssr directly by means of 


ssp = DX’ — N (9) 


than it is to compute ss, directly. Hence, the easiest way to secure ss, is 

first to compute ssr and ss4, and then to subtract the latter from the former. 
That is, ss, is best computed as a residual by 

SS = SST — SSA. (10) 

The results of the computations may be summarized in a table as follows: 


Summary Table 


Sum of Squares Mean Square 


df 


‘Treatments (A) a— l| ssa= XT/n, — T*/N | msa = ssa/a—1 
j=1 


Source of Variation 


Within-groups (w) N — a | 88w = $$v — 884 


Total N-1| ssr = EX- TJN 
a 


n; nj 
We begin the computation by first securing ))X = T; and 2, X^ for each 


individual treatment group. Given an adding machine and a table of squares, 
we can enter each X in turn on the left side of the keyboard, looking up the 
square of this value and entering it at the same time on the right side of 


the keyboard. Thus, Xx and yx may be obtained simultaneously. 


(This can be done more conveniently, without the use of a table of squares, 
if an automatic calculating machine is available.) We then add the T's for 


the various groups to secure T. Each T; is then squared and divided by its 
2 


corresponding n;; these values are summed and N is subtracted from their 


sum to yield the sum of squares for treatments. The sums of the squared 


CH ith 
X’s are then totalled for all treatment groups to give 27? X^, from which 


2 j=l 
x is subtracted to yield the total sum of squares. The sum of squares for 


within-groups is then secured as a residual. These steps will be made clearer 
by the worked example which follows. 


THE MEASURE OF DISCREPANCY AS A "MEAN SQUARE" RATIO 57 


A Worked Example: Suppose that a certain experiment involves four treat- 
ments, A;, As, As, and A,, administered to independent samples of 5, 4, 6 
and 4 subjects, respectively. The criterion measures secured at the close of 
the experiment are given in the table below. 


Compulational Example 


Criterion Scores 


Treatment | Treatment | Treatment | Treatment 
Ay 


nj 4 6 4 N=19 

T; 23 29 12 16 T = 80 
Y 119 225 30 66 ix = A40 
M; 4.60 1.25 2.00 400... M AMI 
Ti 529 841 144 256 

Ti/n; 105.80 210.25 24.00 64.00 Xn, = 404.05 


a 2 
ssa = YT S/n; — T — 404.05 — 336.84 — 67.21 
j=l 


s "j mA 

ssp = OX’ — + = 440 — 336.84 = 103.16 
j=l 

SSw = SSr — $84 = 103.16 — 67.21 = 35.95 


The results of the first computational steps are also given. It is not neces- 


i 
sary to secure and record ^X^ separately for each treatment group, but 


ordinarily this is done since it is desirable to check the work in small units. 
Then if an error is found the computation need be repeated only for the 
particular group involved. The importance of checking each step carefully 
before going on to subsequent steps cannot be overemphasized. Formulas 
(8) and (9) are used to compute the sums of squares for treatments and for 


58 THE SIMPLE-RANDOMIZED DESIGN 


total, and the sum of squares for within-groups is then secured as a residual. 
The results are summarized in the table below: 


Source of Variation df ss ms 
Treatments (A) 3 67.21 22.40 
Within-groups (w) 15 35.95 2.40 
Total 18 103.16 
22.40 
F= 240 ^ 9.33 


In this computational example, non-significant digits have at times been 
retained in the results. For instance, since T, = 23 contains only two signifi- 
cant digits, there can be only two significant digits in Mi, which strictly 
should be written 4.6 rather than 4.60. However, for simplicity in com- 
putations of this kind [based on formulas like (8) and (9)], the rule will be 
followed in this text of always carrying all intermediate results to two more 
decimal places than are significant in an individual observation. Final results 
used in the tests of significance — that is, the F's and ¢’s — will be given 
only to two decimal places, to be consistent with the entries in Tables 1 
and 2. This means, as in the example already presented, that non-significant 
digits will frequently be carried in the intermediate steps in the computa- 
tion, and that sometimes not all digits in the F’s or (’s reported will be sig- 
nificant. However, this computational procedure is satisfactory for all 
practical purposes and is simple to employ. 

The Expected Value of ms»: We have thus far completed the first two of 
the steps involved in testing the null hypothesis. That is, we have defined 
(and shown how to compute) our measure of discrepancy and have determined 
its sampling distribution. Before going on to the remaining steps, we will 
consider certain important properties of the mean squares for treatments 
and within-groups. Specifically, we will demonstrate that ms, is an un- 
biased estimate of the common variance (c?) of the treatment populations, 
or that the “expected” value of ms, is o°. The “expected” value of a sample 
statistic is defined as its average value for an infinite number of similar 
(independent) samples. We will also derive an expression for the expected 
value of msa. These relationships will add considerably to the meaningful- 
ness of our measure of discrepancy and of our test of significance, and will 
provide us later with a basis for estimating the standard errors of individual 
treatment means and for testing the significance of the differences for indi- 
vidual pairs of means. 

The derivations in this section and that following are entirely algebraic 
in character and should be within the grasp of most students using this 


THE MEASURE OF DISCREPANCY AS A "MEAN SQUARE" RATIO 59 


text. However, the student who is not adept at algebra can perhaps afford 
to skip these derivations and to take (11), (16) and (17) for granted. Most 
students, however, should read the interpretive comments following these 
derivations (pages 62-64). 

We shall begin by showing that the expected value of ms, is o°, or that 
for a single experiment, ms, is an unbiased estimate of c?. This has already 
been shown (page 51) for the case in which n; — n is a constant, but we shall 
now prove it for the general case in which n; is variable. 

Suppose that a large number (k) of similar simple-randomized experiments 
have been performed with the same experimental treatments, each experi- 
ment having been performed under the same conditions with independent 
random samples drawn from the same population. As before, we will let 
M; represent the observed mean for the jth treatment in a single experiment, 
M the weighted mean of the M/s, u; the population mean for Treatment j, 
and y the weighted mean of the us. For any measure (X) in the jth treat- 
ment group in a single experiment, we may then write the identity 


(X = nj = (X - Mj + (M; - nì). 
Squaring both sides of this expression 
(X= uj = (X — Mj + 2(X — M) (M; — n) + (M; — n). 
Summing such expressions for the n; measures in the treatment group, 
2 


Z(X — uj = Z(X — Mj) + 2(M; — »)X(X — M) + n(Mi — n; ). 
Transposing, and noting that $} (X — M;) = 0, 
Z(X — Mj! = X(X - n) — n(Mi;— n3). 
Summing such expressions for the a treatment groups, 


ÈE- My = EEX- wd? = Doss wa 


j= 


Summing such expressions for all k experiments and dividing by k, 


k a B oa h a 
lyYxu-uy-lYXLXG--ixXYwM,-u 
k p k cm k rj 


k k 
Nw 2222 — a A DM au» 
= Pa m dni = : 


k 
Since x symbolizes the process of finding the mean of k values, the pre- 
ceding expression may be read 


“mean of S (X— My for k experiments" 
j=l 


k 
a (= dt E (My uo? 
LY eS Mina 


60 THE SIMPLE-RANDOMIZED DESIGN 


Now, letting k become infinitely large (k— œ), this means that in the long 
run the 
2 


a "y a 
mean of $3: (X — Mj! = Enot- Yn7- 2 
T T 
= No? — ag? 
- c? (N — a). 
From this it follows that in the long run (k > œ) the 
LLX- M” 
mean of TER (= ms,) = 0%, 
which is equivalent to saying that ms, is an unbiased estimate of o?. It is 
also equivalent to saying that the “expected” value of ms, is o?. The symbol 


ng 


E(- : » when k — œ) stands for “expected value of.” Thus, 
E(ms,) = o. (11) 


The Expected Value of msa: The expression for the expected value of ms; 
will be derived only for the case in which n; = n is a constant. We first write, 
for a single treatment group in a single experiment, 


(M; — M) = (M; — u) + (u — M) 
= (M; - p) - (M - y). 
Squaring both sides of this identity, 
. (M; - MY = (M; — uy + (M — u)? - 2(M; — u)(M — y). 
Multiplying both sides by n;, and summing for the a values of J, we get 
iM; -My- 2M, =u) ina - jy - 257n(M; — uXM — y), 
= j= j= = 
in which, if we let N = dani, 
7 


nM- 9! = NM - y, 


and 
22 24M, — (M — à) = 2(M - DEnAM = 9 
= 2(M— Xen, ;M; - m 
= 2(M - (NM — Nj) 
E = 2N(M — yy 


Sn, =M)' = nan -= uy — N(M - yy. 


THE MEASURE OF DISCREPANCY AS A "MEAN SQUARE" RATIO 61 
Summing these expressions for the k experiments and dividing by k, 
k 
1 k a 1 LO 3 N M = 2 
JE My -1D nM- -DEM a 
= T 
Now, for a single treatment group in a single experiment 
n(M;— uy = n4(M; — uj  (u; — Y 
= n4(M; — ni + (ai — n + (M; — pi) (u; — 9). 
Summing these expressions for the k experiments and dividing by k, 
k 


k k a — ; 
DNUS -ay- 1X»; — uj" ni(u; — n)? + 2(u; — denis a). 


If we let k — c, we note that in the preceding expression 


1 k 2 
Zoi; - a = noi; = na =0", 
and 
k 
Ms a: 
2s — 22018) 2 3, 4) 9-0. 
Hence, 


k 
IY nM, - à = o + nius — i. 
Accordingly, in (12), when k > œ% 
1 k a a a 1 k " 
12-22; =i) = 2422n(M; TH) 
j=l m 
= Mj + nilu — n] 
m 
= a? Dni — uy (13) 
= 
and 
J 2 
NEM =a Noh, 


But if n; is constant il 
M=; (Mit M+: TM; M). 


Hence, ï 
2 2 2 2 
ou — gnt Oyuyt xp TMa) 
2 
1 aci Ca: 
= D PE M 
ai ft GT OUO 
and 


k 
2 
NEM ADE Noi = Noe c? (14) 


62 THE SIMPLE-RANDOMIZED DESIGN 
Hence, substituting from (13) and (14) in (12), 
JE Én- My a+ En- i - o! 
= (a= Dot + Ynju - à. 
Dividing both sides by (a — 1), 


pee rt nies Samer ore en 


jl 


Hence, the expected value of msa is 


Yn; = D 


E(ms4) = e? + -— (16) 


ü=] 
Interpretation of Expected Value of msa: It is difficult to attach a useful 


meaning to niu - 9 /(a— 1). It is neither a population variance 


nor an estimate of one, nor is it the variance of any actual distribution. 
Only if n, — n is a constant and the particular treatments involved in the 
experiment may be regarded as a random sample from a population of treat- 
ments does it have a clear meaning. In this case it is times an unbiased 
estimate of the variance of a population consisting of the means of an infinite 
number of treatment populations. However, there are very few experiments 
in education and psychology in which it makes sense to regard the particular 
experimental treatments as a random sample from any real or hypothetical 
population of treatments. 

If a single factor (A) underlies the treatment classification, so that the 
various "treatments" represent different amounts, or durations, or inten- 


a 
sities, etc., of a single experimental variable, 2 niu; — u)'/(a — 1) may 


possibly be regarded as a “measure” of the “potency” of the experimental 
factor. It measures this potency in the sense that the higher the relationship 
between the experimental and criterion variables, the larger the differences 
(uj— u) will be. There is no satisfactory logical basis, however, for weight- 
ing the squared deviations by a variable n, since n; is arbitrarily selected. 


Even though n;=n is a constant, the meaning of ns — uy'/(a — 1) 
depends on another wholly arbitrary choice — namely, ihe choice of the par- 
ticular amounts, or durations, etc., of the experimental variable represented in 
the experiment. For example, if the “treatments” represent varying amounts 
of practice in performing a certain task, Lotus — u)'/(a — 1) will obviously 
be quite different if the experimental comparisons are among 2, 3, and 4 


THE MEASURE OF DISCREPANCY AS A "MEAN SQUARE" RATIO 63 


hours of practice than if they are among 1, 3, and 5 days of practice, or 
among 1, 5, and 9 days of practice, etc. 


All that one can safely say about »7n;(u; — u)'/(a — 1) is that its mag- 
j=l 


nitude depends on the differences among the population means for the par- 
ticular treatments selected for experimental comparison — noting carefully 
the wholly arbitrary manner in which these treatments may have been 
defined or selected and the weights (n,) assigned. In this restricted sense, 


Din;(u; — u)/(a — 1) may be regarded as a measure of the potency of the 
j=l 


“treatments effect," but even then it is best limited to the case in which 
n; = nis a constant. 

The mistake ' has frequently been made in experimental work of attempt- 
ing to use the mean square ratio (ms4/ms,) as a measure of the relative potency 
of the experimental factor, that is, of its potency in comparison with that of 
the factors which give rise to differences among subjects within the same 
treatment groups. That this is not a valid interpretation of F is evident 
from (16) which, since E(ms,;) = o°, may in the case in which n;=n be 
written 


Eins) = Emsa) + Xnlu = w/a = 1). Q7) 


Accordingly, if in a particular experiment ms4 and ?ns, happened to be equal 
to their expected values, the mean square ratio would be 


Donita; =p) 
mane 
msa _ &—1 a (18) 
MSw MSw 


From this it is apparent that F depends not only on the magnitude of the 
differences among the treatment population means, but also on the variabil- 
ity of the experimental material (as measured by ms,), on the sizes of the 
experimental groups (nj and on the number of treatments (a). In other 
words, F depends not only on the real differences among the particular 
treatments involved, but on the precision of the experiment and the number 
of treatments as well. A high F does not necessarily mean that the treat- 


1 Tn particular, it has been argued (see Peters and Van Voorhis, Statistical Procedures 
and Their Mathematical Bases, McGraw-Hill Book Company, Inc., 1940, pp. 324-325) 
that €, the unbiased correlation ratio (which is a function of F and the numbers of cases 
and treatments) is superior to F as a basis for testing the null hypotheses, since € may 
be interpreted as a measure of the “strength of the relationship" between the experi- 
mental and the criterion variables. This argument overlooks entirely the fact that in 
most applications of analysis of variance to experimental designs, the value of either 
F or € depends upon the arbitrary choice of categories in the treatment classifications, 
and hence is not meaningful as an index of strength of relationship. 


64 THE SIMPLE-RANDOMIZED DESIGN 


ments differ greatly, nor does a low F necessarily mean that they are much 
alike. 


In spite of the difficulty of interpreting Èn;lu; — u)'/(a — 1), we can 


draw some very useful inferences from a6). We may note first that 
if the null hypothesis is true, that is, if uj = m=... = ua = p, then 


a 
Don,(us — uy'/(a — 1) equals zero, and the expected value of msa is *, This 
j=l 


is the same as the expected value of ms, under the null hypothesis. Accord- 
ingly, if the null hypothesis is true and both ms, and ms, have their expected 
values, the ratio between them is 1.00. However, if the null hypothesis is 
false and both ms, and ms, have their expected values, the ratio between 
them will be greater than 1.00. This confirms our choice of a measure of dis- 
crepancy, and has further implications which will be made clear in the fol- 
lowing section, 

The F-ratio as a Ratio of the Observed Variance of the Treatment Means to 
Their Expected Chance Variance: It will be useful for purposes of subsequent 
discussions to observe that the F-ratio of (5) may be regarded as the ratio 
of the observed variance of the a treatment means to the variance that 
they would be expected to have as a result only of chance fluctuations in 
simple random sampling. Let us note first that when n; =n is constant, 
our best estimate of the variance (c?) of the population from which the treat- 
ment groups are drawn is that denoted as estss? on page 51. The variance 
in the means of an infinite number of random samples of n cases each drawn 
from a population whose variance is c? is oj = c^/n. Hence the best 
estimate of c? that we can secure from the treatment groups is 

td o? 


est'd oi = SAS L E DEMI — 29 We know thapif ayandom 
j=l 


sample of n cases is drawn from a population whose varance is c?, the average 


2 


9 - Accordingly, the expected 


J e n 
or expected variance of the sample is 
chance variance of a sample of a means drawn from a population of means 

. . 2 
whose variance is c3; would be 


etos ur D 
a a 


iia 
ap EELA- My - 1. 

i=1 
With these facts in mind, let us examine the F-ratio (5) when both numerator 
and denominator are multiplied by (a — 1)/an. The result is 


=(M; — M) 
F= a 
-D.L EEx- Mn- 


j=l 
_ observed variance of treatment means 
expected chance variance of these means 


THE MEASURE OF DISCREPANCY AS A "MEAN SQUARE" RATIO 65 


Defining the Region of Rejection: We are now ready to go on with the re- 
maining steps (page 49) in testing the null hypothesis. We shall postpone 
temporarily any consideration of Step 3, that of deciding what risk to take 
of rejecting a true hypothesis, and go on first to Step 4, that of marking off 
the region of rejection along the scale of possible values of the measure of 
discrepancy. We are interested, of course, in the possibility that the treat- 
ments really differ in their effectiveness, and that this has caused the observed 
variance of the means to be larger than it would be as the result of chance 
alone. In other words, we are interested only in the possibility that the 
ratio of observed to chance variance of treatment means is too large to be 
reasonably attributed to chance. If the obtained ratio turns out to be less 
than 1.00, we have no basis for rejecting the hypothesis. If the assumptions 
underlying the test are true, an F of less than 1.00 could be due only to chance. 
The only other possible explanation is that the assumption of random sam- 
pling is false in that the treatment groups are more alike than random samples 
would be. It is apparent, then, that the region of rejection should be en- 
tirely under the right tail of the distribution. That is, the test will be of 
the type known as a “one-tailed” test, as contrasted with one in which the 
region of rejection consists of two parts, one under the right tail and one 
under the left tail of the sampling distribution. 

Suppose now that we have defined the region of rejection as that lying to 
the right of the 1% point in the F-distribution, and that our measure of dis- 
crepancy (the mean square ratio) has been found to lie in this region. If 
the basic assumptions have been satisfied, (pages 51-52), there are only two 
possible explanations for this event. One is that the null hypothesis is true, 
and that the ratio has fallen in the region of rejection only as a result of 
chance fluctuations in random sampling. We know that under a true null 
hypothesis this would happen only very rarely (one percent of the time). 
To retain the null hypothesis, then, we must contend that a rare event has 
actually “come off” in this instance. The other explanation is that the null 
hypothesis is false, or that the treatments really differ in effectiveness, and 
that the large ratio is due primarily to these real differences rather than 
to chance. If we choose this latter interpretation, we need not contend that 
a rare event has actually occurred, and hence, we prefer the second inter- 
pretation. We know, however, that under a true null hypothesis the mean 
square ratio will fall in this region of rejection one time out of every one 
hundred in the long run. In choosing the second interpretation, then, we run 
the risk that this is one of those times, and that we are rejecting a true null 
hypothesis. How large this risk is depends, in this one-tailed test, upon the 
* percent point” in the F-distribution at which the region of rejection begins. 
If the region of rejection begins at the 5% point, we will in the long run 
reject an hypothesis, when it is true, five percent of the times it is tested. 
That is, we will be taking a 5% risk of rejecting a true null hypothesis, or 
will be making the test at the 5% level of significance. Generally, we wish to 
keep this risk small. It is common practice, therefore, to make the region 


66 THE SIMPLE-RANDOMIZED DESIGN 


of rejection begin at the 5%, the 1%, or sometimes even at the 0.1% point 
in the F-distribution. In other words, it is customary to make the test at 
a "high" level of significance — a high level being one corresponding to a 
small risk, and vice versa. 

The F-table (Table 3, pages 41-44) has been constructed so that we may 
employ any one of seven convenient ! levels of significance, corresponding 
to the 20%, 10%, 5%, 2.5%, 1%, 0.5% and 0.1% points in the F-dis- 
tribution. For convenience in the discussion, we will refer to the corre- 
sponding values of F as the “20% value of F,” the “10% value of F,” etc. 

An Example: Suppose that in an actual experiment, we have decided in 
advance to make the test of significance at the 5% level, or to reject the 
hypothesis if the mean square ratio exceeds the 5% value of F. For reasons 
to be considered later, this decision should always be made before examining 
the data. Suppose that the ratio turns out to be 13.27, and that the degrees 
of freedom for the ratio are 3 and 5. Turning to the table for F, we find that 
the 5% value for 3 and 5 degrees of freedom is 5.41. Accordingly, we reject. 
the null hypothesis. We note that in this case we could have rejected the 
hypothesis had we made the test at the 1% level. However, we should not 
change the level of significance at which the test is to be made after seeing 
the experimental results, This would be much like changing the betting odds 
on a horse race after the race is over. 


Type | and Type Il Errors 


Tt does not follow from the preceding discussion that it is always desirable 
to set a very high level of significance for the test of the null hypothesis. 
We must keep in mind that there are always two kinds of errors possible in 
testing any statistical hypothesis. One is the error (sometimes called a 
Type I error) of rejecting an hypothesis when it is true. The other is that 
(Type II) of retaining an hypothesis when it is false. The risk of making 
a Type I error is exactly determined when we establish our region of re- 
jection (assuming, of course, that all the basic assumptions are exactly satis- 
fied). The relative frequency with which a false hypothesis would be re- 
tained under the same circumstances, however, cannot be determined in 
practice. This frequency depends on how far the hypothesis departs from 
the truth. Suppose, for example, that the (null) hypothesis is almost true, 
or that the treatments are very nearly the same in effectiveness, and the 
treatment population means differ only very slightly. If a large number of 
similar experiments were independently performed with these treatments, 
the mean square ratios obtained would be only slightly larger than if the 
null hypothesis had been true. Let us remember again that the mean square 


! There are ways of interpolating between these values (see C. J. Burke, * Computa- 
tion of the Level of Significance in the F-test,” Psychological Bulletin, vol. 48 (Septem- 
ber, 1951) pp. 392-397); but there is hardly ever any practical need to employ them. 


TYPE | AND TYPE II ERRORS 67 


ratio is the same as the ratio of the observed variance of the treatment means 
to their chance variance. The effect of the small real differences among treat- 


ments would be to add a small and constant amount b» (n; — u)/ (a — D] 


to the numerator of each ratio [see (18), page 63], and this would affect, the 
value of each ratio only slightly. Under these circumstances, the actual 
sampling distribution of the ratios might be that represented by Curve A 
in Figure 3, while Curve F might represent the sampling distribution that 


A 
F- B 
Denm 
T. 
o 
Region of acceptance Region of rejection 

5% point in 
F-distribution 


Ficure 3. Relation of possible actual distributions (A and B) 
of mean square ratios to the distribution (F) that would 
have been obtained had the null hypothesis been true 


would have been obtained had the null hypothesis been true. The diagonally 
ruled area represents 5% of the area in the F-distribution; the horizontally 
ruled area in the A-distribution represents the proportion of the times that 
the false null hypothesis is retained when this region of rejection is employed. 
In the situation represented by Curve A, the false null hypothesis would 
be retained almost 95% of the times it is tested. 

Suppose, on the other hand, that the null hypothesis is far from true, or 
that the treatments differ markedly in effectiveness and that the population 
means for these treatments differ widely. If a series of similar experiments 
were performed with these treatments, large ratios of the observed to the 
expected variances of treatment means would nearly always be obtained. 


This is because the effect of adding a large constant b» il(u; — n)/ (a — | 


to the numerator of each ratio would be to increase the value of each ratio 
markedly, particularly of those that would otherwise be less than or only 
slightly larger than 1.00. In this case, the actual sampling distribution of 
these ratios might be very crudely represented by Curve B in Figure 3. 
The vertically ruled area in this distribution indicates the relative frequency 
with which the false null hypothesis would be accepted in this case. We see 
that the false null hypothesis would only rarely be retained, or that the risk 
of a Type II error would be quite small. 

It should now be clear why it is not always desirable to set a very high 
level for the test of significance. From the preceding illustration, it is evident 


68 THE SIMPLE-RANDOMIZED DESIGN 


that, other things being equal, the higher the level of significance of the 
test, the greater is the danger of retaining a false hypothesis. If we try to 
reduce the risk of a Type I error, we increase the risk of a Type II error. 
Since the consequences of either type of error are unfortunate, we must set 
the level of significance of our test with both types of error in mind. 

The Consequences of Type I and Type II Errors: 'The consequences of a 
Type II error are quite different in nature from those of a Type I error. 
To weigh the relative seriousness of these consequences, it will be useful to 
regard psychological and educational experiments as classifiable into two 
broad categories. The first category includes experiments designed primarily 
to determine whether or not a certain criterion variable (X) depends on, or is 
related to, a certain experimental variable (Y). Such an experiment may 
be termed an “exploratory” experiment, whose purpose is to determine if Y 
is or is not “a factor" of X. If a significant result is not obtained in an ex- 
ploratory experiment, the conclusion is tentatively drawn that Y is not a 
factor, and the experimenter's attention is usually turned to other possible 
causes of, or factors of X. If a significant result is obtained in the exploratory 
experiment and it is concluded that Y is a factor of X, the experiment may be 
followed by an experiment (or series of experiments) belonging to the second 
category. Experiments in this category are designed to determine the nature 
of the functional relationship between Y and X — preferably to describe this 
relationship in algebraic form. Exploratory experiments, then, are those 
that, when significant results are obtained, lead to further experiments with 
the same factor or factors. 

If a Type I error is made in the exploratory experiment, that is, if a “ sig- 
nificant” result leads to a false conclusion that Y is a factor of X, the likely 
consequence is that time and effort will be wasted on further experiments 
designed to determine the nature of the relationship between Y and X. To 
minimize the danger of thus following a false lead, we usually set a high level 
of significance for tests made in exploratory experiments. 

If we make a Type II error in the exploratory experiment, that is, if the 
null hypothesis is false but we fail to get a significant result and therefore 
falsely conclude that Y is not a factor, the likely consequence is simply that 
we will fail to follow up a true lead. In a sense this is not as serious as to 
have wasted time following up a false lead, since in the meantime we may 
be trying out other possible leads, all of which might eventually have to be 
tried out anyway. Furthermore, it will be generally understood that we 
have not proved that Y is not a factor, so that anyone else who has his own 
reasons to believe that Y is a factor is at liberty to plan experiments to prove 
his contentions. 

'The preceding is, of course, an oversimplification of the situation. In 
practice we are frequently not so much concerned with whether Y is or is 
not a factor, categorically, as with whether or not it is a relatively important 
factor. Having performed exploratory experiments with a number of possible 
factors, all of which may be real but not equally important, we would like to 


TYPE | AND TYPE Il ERRORS 69 


give priority in subsequent experimentation to the factors which are most 
important. If we always set a high level of significance for our tests at the 
exploratory level, we may be quite sure that we will not follow many com- 
pletely false leads, and at the same time, we will have some assurance that 
the true leads which we ignore (because of Type II errors) are probably 
among the less promising ones. 

The distinction between “exploratory” experiments and others is an arbi- 
trary one. Any experiment may be termed an exploratory experiment 
if its results provide the basis for further experimentation which may be 
fruitless if those results are not sound. Accordingly, what has been said 
about the consequences of the two types of error in exploratory experiments 
is really applicable to all experiments. 

It is perhaps on the basis of reasoning somewhat like the preceding that 
experimenters have usually been much more concerned about the consequences 
of Type I than of Type II errors, and have typically set a rather high level of 
significance for their tests of treatment effects. However, the analysis just 
presented is still considerably oversimplified. "There are many situations in 
which the consequences of a Type II error are clearly more serious than those 
of a Type I error. For example, suppose that a given public school system has 
been employing a certain method (Method A) of teaching elementary school 
spelling, and that it has been suggested that another method (Method B) be 
substituted for it. Suppose also, that the change could be made at relatively 
little cost and inconvenience, and that the continuing instructional costs under 
Method B would be about the same as those under Method A. The super- 
intendent of schools decides to base his decision on the outcome of an experi- 
mental comparison of the two methods. He decides that he will introduce 
Method B only if a statistically significant difference is found in favor of 
that method, but reasons that he need not set a high level of significance 
for the test, and decides on the 20% level. Since the consequences of a 
Type II error are, in this case, considerably more serious than those of a 
Type I error, he might easily have justified a still lower level. Suppose, on 
the one hand, that Method B is really superior to Method A but that, never- 
theless, no significant difference is found in the experiment. A Type II error 
is thus made, with the consequence that the school retains Method A and 
continues indefinitely to secure poorer results in spelling than it might other- 
wise have secured. On the other hand, suppose that A and B are really equally 
good, but that a significant result in favor of B is nevertheless obtained in 
the experiment. A Type I error is thus made, with the consequence that a 
needless but slight expenditure is made to substitute one equally good method 
for the other. Clearly, the consequences of the Type II error are the more 
serious in this case. Had the cost of a changeover and the cost of con- 
tinued operation under B been very much higher, the consequences of a 
Type I error would have been relatively much more undesirable, and 
a much higher level of significance might then haye been set for the test of 
significance. 


70 THE SIMPLE-RANDOMIZED DESIGN 


It should now be clear that it is dangerous to attempt any generalization 
about the relative seriousness of Type I and Type II errors. The experimenter 
should always give careful consideration to the consequences of errors of both 
types, and should set the level of significance of his test accordingly. In 
other words, the whole problem should be thought through independently 
in each new situation. 

Effect of the Precision of the Experiment on the Risk of Type II Errors: 
Unfortunately, we never know in practice to what extent, if any, the null 
hypothesis is false; therefore, we never know what risk we are running of 
making a Type II error. We do know, however, that this risk is frequently 
very large. 

How large this risk is (for any given relationship among the true treat- 
ment means) depends partly on the precision of the experiment. Suppose 
again that there are certain real differences among a number of treatments, 
and that a series of similar experiments to test the null hypothesis are per- 
formed with these treatments. Suppose, on the one hand, that very large 
treatment groups have been employed, so that the precision of the experiment 
is relatively high. 

We have noted earlier that when both numerator and denominator of the 
F-ratio have their expected values, the numerator is ms, plus a constant, 


and the denominator is ms,, the constant being Yni(u; — )?/(a—- 1). 


The value of this constant depends not only upon the magnitude of the 
differences among treatment means (u;— u), but also upon the sizes (nj) 
of the treatment groups. Thus we see that the effect of real treatment differ- 
ences when n; is large is to add a relatively large constant to the numerator, 
but the effect of the same treatment differences when n; is small is to add 
a relatively small constant, so that the F-ratio tends to be large when n; 
is large and small when n; is small. This means, in terms of Figure 3 and 
the accompanying discussion on page 67, that when the null hypothesis is 
false, the actual sampling distribution (B) of the obtained ratios will be 
farther to the left when n; is small than when it is large, and that this dis- 
tribution will also be more variable when n; is small. The result is that with 
a small n; there will be a much greater overlap in the F and B distributions, 
or that the risk of a Type II error will be greater when n; is small than when 
it is large. In other words, the lower the precision of the experiment, the 
greater is the risk of retaining a seriously false null hypothesis. 

Statisticians sometimes refer to the "power" of a test. By this they 
mean the probability of rejecting a false hypothesis, that is, (1 — p), p being 
the probability of a Type II error. The power of a test, then, is dependent, 
among other things, on the precision of the experiment. 

Possibilities of Controlling the Risk of Type II Errors: We have seen that in 
testing the treatments effect in a simple-randomized experiment, the risk of 
a Type II error depends (1) on how false the null hypothesis is, (2) on the 
precision of the experiment. The precision of the experiment depends upon 


TYPE | AND TYPE Il ERRORS 71 


the variability of the experimental material (measured in this case by the 
error mean square, that is, by the within-groups mean square) and on the 
sizes of the treatment groups. The sizes of the treatment groups are, of 
course, subject to the experimenter's control. Ideally, therefore, it would 
seem that in planning the experiment, the experimenter's aim should be 
to make the experiment just precise enough (or the groups just large enough) 
so that the danger of retaining a "seriously false" hypothesis would not 
exceed a specified risk. More specifically, the ideal procedure would seem to 
be as follows: first, decide how large a departure from the truth we are willing 
to tolerate in the null hypothesis (define exactly what we mean by a “seriously 
false" null hypothesis); second, specify the risks we are willing to take of 
making Type I and Type II errors, respectively; third, secure a dependable 
estimate of the error variance; and fourth, calculate exactly how large our 
samples need be in order that the risk of a Type II error may not exceed the 
specified value. 

There are three major reasons why, in psychological research practice, 
this ideal cannot often be attained. One is that we are seldom able to secure 
in advance any useful estimate of the error variance. In many experiments 
the criterion measure to be employed is one with which we have had little 
or no previous experience. Frequently, therefore, we are unable to estimate 
in advance of the experiment what variability our experimental subjects will 
show in relation to this criterion. Furthermore, in practice most of our designs 
are relatively complex designs in which the error variance depends not only 
on Type S errors, but on Type G and Type R errors as well, about which we 
know even less. In most instances, therefore, in order to obtain a useful 
estimate of the error variance, we would have to perform a preliminary ex- 
periment conducted simply in order to provide such an estimate, and this 
is often impracticable. 

A second difficulty is that we are rarely able to attach any fundamental 
or absolute meanings to the scale along which, or to the units in terms of 
which, the criterion measures are expressed. For this reason, we often 
have only an inadequate basis for defining just what we mean by a “seriously 
false” null hypothesis. There are some situations, however, in which as the 
result, of some advance experience with the criterion, a fairly useful empirical 
basis for such a decision may be available. 

The third difficulty is that, for most experimental designs, we lack the 
theoretical basis for making the necessary calculations. This basis has been 
worked out for the simple case of a simple-randomized experiment involving 
only two treatments with treatment groups of the same size,’ but not for 
the F-tests in complex designs involving several treatments. Students work- 
ing on specific experimental problems who have available an estimate of the 
population variance, and who can meaningfully define a “serious discrepancy " 


1 William G. Cochran and Gertrude M. Cox, Experimental Designs (New York: John 
Wiley and Sons, Inc., 1950), pp. 15-26. 


72 THE SIMPLE-RANDOMIZED DESIGN 


between fact and hypothesis, are advised to acquaint themselves with this 
procedure.! 

In general, then, we cannot often design an experiment so as to be sure 
that a specified risk of retaining a "seriously false" null hypothesis will 
not be exceeded, even where we can provide a meaningful definition of a 
"seriously false" hypothesis. Even after the experiment has been con- 
ducted and an unbiased estimate of the error variance is available, we usually 
still cannot state what risk of a Type II error is involved in the test of signifi- 
cance, either because we cannot meaningfully describe what we mean bya 
"seriously false" hypothesis, or because the theoretical basis for calculating 
the risk has not been worked out. 

However, in interpreting any F-test, we can always recognize the possibility 
of a Type II error, and we can always give some thought to the consequences 
of such errors. We can take these consequences into consideration in de- 
ciding what level of significance we will adopt in the test of the null hypothe- 
sis, knowing that the higher the level of significance, the greater is the risk 
of a Type II error. If we fail to find a significant result and accept the null 
hypothesis, we can always say that the error, if any, in that hypothesis is 
"not sufficiently large to have been revealed as such by our experiment." 
That is, we may always contend that the risk of retaining a “seriously false” 
null hypothesis is negligible if we always define a “seriously false” null 
hypothesis as one in which the error is so large that it will nearly always be 
revealed in an experiment as precise as that which we have conducted. 

From this point of view, what we mean by a “seriously false” null hypothe- 
sis is determined when we decide on what numbers of cases to employ in the 
experiment. Frequently, in planning the experiment we make it as precise 
as our resources will permit, or as precise as we can justify in terms of its 
cost in relation to the importance of the problem under investigation. Our 
position may be that if the error in the null hypothesis is not sufficiently large 
to be revealed by an experiment with this precision, then the error is of 
little practical consequence anyway, and we will continue to use the null 
hypothesis as a working hypothesis. It is extremely important, however, 
to recognize that this is the nature of the decision we are making when we 
decide on the scope of our experiment; and we should do our best, both on 
the basis of our knowledge of underlying theory and of our previous experience 
with similar data, to make as meaningful as possible the degree of falsity we 
will tolerate in the null hypothesis. 


The Importance of the Assumptions Underlying the F-Test 


The General Effect of a Failure to Satisfy an Assumption: It is very important, 
in any application of the simple-randomized design, to consider very carefully 


! Ibid. 


ASSUMPTIONS UNDERLYING THE F-TEST 73 


the assumptions underlying the F-test of the null hypothesis and the effects 
on the validity of this test of the failure to satisfy one or more of these as- 
sumptions. The ratio of treatments to within-groups mean squares is dis- 
tributed as F if all four of the following conditions are satisfied." 


1) All treatment groups were originally drawn at random from the same 
parent population. 
After administration of the treatments, each group may then be 
regarded as a simple random sample from a different (hypothetical) 
treatment population. 


2) The variance (o?) of the criterion measures is the same for each of these 
treatment populations. 


3) The distribution of criterion measures for each treatment population is 
normal. 


4) The mean of the criterion measures is the same for each treatment 
population (the null hypothesis). 


If any one of these conditions is not satisfied, the sampling distribution of 
mean square ratios may differ from the F-distribution. Generally, if one 
or more of the conditions is not satisfied, the distribution of ms4/ms, will 
be more variable than the F-distribution. This means that if a “significant” 
mean square ratio is obtained in an experiment, it could have resulted from a 
failure to satisfy any one of these conditions. "Therefore, before concluding 
from a significant F that it is Condition 4 (our hypothesis) which is not 
satisfied, we must, assure ourselves that a failure to satisfy any of the other 
conditions is not likely to have any consequential effect on the sampling 
distribution of ms4/ms,. 

The Assumption of Random Sampling: It is very seldom that an experi- 
menter can draw his subjects strictly at random from the real population in 
which he is basically interested. Usually, he must be content to work with 
those members of that population who are readily accessible to him, even 
though the accessible members of the population may differ systematically 
from those who are not accessible. A research worker in psychology, for 
example, might wish to work with random samples from a population con- 
sisting of “all adult American males," but may have to be content with a 
sample consisting of male students in a sophomore course in general psy- 
chology in a particular college or university. A research worker in education 
may wish to conduct an experiment from which he can fairly draw inferences 
about *all fourth-grade pupils in American public schools," but he,may 
have to conduct his experiment in a single school in which the principal 
or superintendent is known to him and is willing to let him have the necessary 
facilities for his experiment. 


1 See footnote on page 51. 


74 THE SIMPLE-RANDOMIZED DESIGN 


Very frequently, however, the experimenter can draw his experimental sub- 
jects strictly at random from those subjects that are accessible to him. If 
not, he can nearly always at least randomize his experimental subjects with 
reference to the treatments. That is, by use of a table of random numbers 
he can leave it strictly to chance which subjects are to constitute each treat- 
ment group. Having done this, he may then fairly contend that his experi- 
mental groups are all random samples from the same hypothetical parent 
population — a population which may be roughly defined as consisting of all 
individuals "like those involved in the experiment." In the case of the 
psychological experiment earlier referred to, this might be “all male students 
who have taken or might take a course in general psychology in College X”; 
or in the case of the educational experiment, it might be *all pupils who have 
been or might become fourth graders in School Y” — in each case assuming 
stable general conditions. 

The device just suggested, of assigning the experimental subjects at random 
to the treatment groups and of defining a hypothetical parent population to 
fit the subjects actually used, must very frequently be employed in experi- 
mental work in order to make possible any statistical test of an exact hypothe- 
sis. Having employed this device, of course, the experimenter should there- 
after restrict his statistical inferences to this hypothetical parent population. 
If he wishes to extend these inferences to any real population, he must do so 
on a “judgmental” rather than on a statistical basis; that is, he must do so 
without benefit of the safeguards provided by the logic of statistical inference. 
The extent to which he may thus extrapolate his inferences to a real popula- 
tion depends upon his own judgment of the extent to which the relative effects 
of the treatments are the same for the real as for the hypothetical parent 
population. (He need not assume that the absolute effects of each treatment 
are the same for both these populations, but only that the relative effects 
of the treatments are the same.) The average sophomore student in general 
psychology, for example, might make a higher criterion score than the average 
“adult American male” in general, but one might still plausibly contend 
that whatever treatment is most effective for college sophomores is also 
most effective for adult American males in general. However this may be, 
if the experimenter randomizes his experimental subjects with reference 
to treatments in a simple-randomized design, he may, so far as the hypo- 
thetical parent population is concerned, regard Condition 1 as completely 
satisfied. 

The repeated use of “hypothetical” with different specific references in 
this discussion may tend to be confusing. The student should guard particu- 
larly against confusing the hypothetical parent population with the hypotheti- 
cal treatment populations. 

A mistake that has very frequently been made in educational research is 
to regard as a simple random sample of pupils a group consisting of several 
intact school classes, in situations in which the classes differ systematically 
with reference to the criterion variable. For example, in an experiment de- 


ASSUMPTIONS UNDERLYING THE F-TEST 75 


signed to compare two methods of teaching a school subject, the experimenter 
may arrange to have Method A used with, say, seven classes in this subject 
in as many different schools, and Method B used with nine classes in another 
set of schools. In another experiment, Method A and Method B may both be 
used in the same schools, Method A being given to one class and Method B 
to another class in the same subject in the same school, or Method A being 
given to a random half of a class and Method B to the other half in each 
school. Experiments involving more than two methods have frequently been 
similarly designed. To test the significance of the differences among the 
“ treatment” groups, the t-test for independent random samples or the F-test 
of the simple-randomized design has been used, regarding the treatment 
groups of combined classes as simple random samples of pupils. This prac- 
tice is legitimate only if there are no systematic differences among classes so 
far as the criterion variable is concerned. With the criteria usually employed 
in these experiments (achievement in school subjects), this is very rarely the 
case. Differences among teachers, among communities, among school facili- 
ties (school plant, instructional equipment, libraries, etc.) and other factors 
typically cause the schools to differ markedly in achievement. Sometimes 
the differences in mean achievement from school to school are of almost the 
same magnitude as the differences in achievement among individual pupils 
in a single school. In such experiments, the treatment groups might be re- 
garded as random samples of schools or of school classes, but not as random 
samples of pupils. Appropriate methods of analyzing the results of such ex- 
periments and other designs more appropriate in such situations will be 
considered in later chapters, particularly Chapters 7 and 8. 

To justify this practice of combining a number of intact school classes and 
regarding them as simple random samples in statistical tests, experimenters 
have sometimes applied preliminary tests to the means and variances of the 
classes (such as the F-test of analysis of variance and the Bartlett test of 
homogeneity of variance). If these tests have failed to reveal significant. 
differences among the classes, the experimenters have contended that the 
combined classes might legitimately be regarded as simple random samples. 
The weakness of this logic is that the precision of these tests is usually low, 
due to the small sizes of the classes, and the danger of accepting a false null 
hypothesis is therefore large. (See page 70.) The fact that a statistical 
test has a non-significant outcome does not prove the hypothesis tested, but 
only demonstrates that the observed results could have arisen by chance if 
the hypothesis were true. In every case, there are many other hypotheses, of 
course, under which the same results might also arise. Sometimes (although 
perhaps rarely) the criterion variable employed may be such that conse- 
quential systematic differences among schools are unlikely, and the combined 
classes may legitimately be regarded as simple random samples, granting that 
no significant differences can be found among them. In general, however, 
School classes should thus be combined and regarded as simple random 
samples of pupils only if the assumptions of homogeneity of the means and 


76 THE SIMPLE-RANDOMIZED DESIGN 


variances of the classes are strongly supported by a priori considerations as 
well as by the outcomes of statistical tests. 

The Assumption of Homogeneity of Variance: Before considering the effect 
upon the validity of the F-test of the failure to satisfy the assumption of 
homogeneity of variance, it will be well to consider first how it is that heter- 
ogeneity of variance arises in educational and psychological experiments. 

Suppose that, just before administering the treatments in a simple-random- 
ized experiment, observations are made of the criterion variable for all of 
the experimental subjects. That is, suppose that an initial as well as a final 
criterion measure is obtained for each subject. We could then define the 
“effect” of a given treatment on a given subject as the difference between 
his final and his initial measure. (Strictly, this is the effect of the treatment 
plus the effect of any extraneous factors which may be associated with the 
treatment in the experiment.) 

It is generally likely that this effect will vary from subject to subject 
for the same treatment, but it is often possible that the variance of these 
effects is the same for all treatments, and it is also often possible that these 
effects are uncorrelated with the initial measures. The variance of the 
final criterion measures for each treatment group would then be equal to 
the variance of the initial measures plus the constant variance of the treat- 
ment effects. In this case, since the variance of the initial measures is, except 
for chance, the same for all treatments, it follows that the assumption of 
homogeneity of variance of the final criterion measures would be satisfied. 

A more likely possibility is that the variance of the treatment effects differs 
from treatment to treatment, and also that the effects are correlated with the 
initial measures, but differently for different treatments. For example, in 
an experimental comparison of several methods of teaching a given school 
subject, certain methods might be more effective with bright than with dull 
students, and others may tend to be equally effective for students at all 
levels of intelligence. If the criterion measure is the score on an achieve- 
ment test, it is likely that for some treatments these scores will be substan- 
tially correlated with intelligence, from which it follows that the treatment 
effects will be correlated with the initial measures. We know that the vari- 


ance of the sum of two related variables is given by ois = vi + o3 + 2rre0102. 


Other things being equal, the variance of the final criterion measures will 
then be larger for the treatments whose effects are correlated with the initial 
measures than for those for which they are not. In cases of this kind, then, 
the assumption of homogeneity of variance would not be exactly satisfied. 
We may note, however, that the variance of the final criterion measures 
might be of very much the same magnitude for all treatments even though 
the variance of the treatment effects for individual subjects differs considerably 
from treatment to treatment, and/or even though the correlation of the treat- 
ment effects with the initial measures also differs considerably from treat- 
ment to treatment. This would be true if, for all treatments, the variance of 
the treatment effects were small in relation to the variance of the initial 


ASSUMPTIONS UNDERLYING THE F-TEST 77 


measures. Suppose, for example, that in an experiment involving two treat- 
ments, the initial variances were each 20 and the variances of the treatment 
effects were 2 and 5, respectively, and that these effects were uncorrelated 
with the initial measures in both cases. The variance of the final criterion 
measures would then be 22 for one treatment and 25 for another. This differ- 
ence amounts to only about 10% of either variance, even though the variance 
of the treatment effects is more than twice as large for one treatment as for 
another. The difference would still be small if the treatment effects were 
moderately but not highly correlated with the initial measures. This type of 
situation may prevail in many educational and psychological experiments. 
Very frequently the criterion measure employed is one in which the parent 
population shows a large variance and in which the superimposed variances 
of the treatment effects are small in relation to this variance. In this case, 
even in the unlikely event that the variance of the treatment effects is several 
times larger for some treatments than for others, the variance of the final 
criterion measures may nevertheless be substantially the same for all treat- 
ments. It is quite common to find, in psychological experiments, that the 
treatments do not differ sufficiently to cause the observed means to differ 
by more than a relatively small percent of the general mean. In such a situa- 
tion, it is hard to believe that the treatments would cause the within-groups 
variances for one treatment to differ greatly — so much, say, as to be twice 
as large for one treatment as for another. 

The type of experimental situation in which marked heterogeneity of vari- 
ance is particularly likely to occur is that in which the variance of the initial 
measures of the criterion variable, if available, would be found to be small in 
relation to the final variance for any treatment, and/or in which the final 
variances of the treatment groups are substantially correlated with their 
means. This frequently happens in trend studies, that is, in experiments 
designed to measure the effects upon a criterion variable of increasing amounts 
of a single experimental variable. Suppose, for example, that in a certain 
learning experiment the criterion measure is the improvement or gain made in 
a certain variable under the experimental conditions, the “treatments” rep- 
resenting different duralions of the same experimental condition. By defini- 
tion of the criterion measure, the population variance is 0 at the beginning 
of the experiment, and the variances of the treatment effects will be closely 
related to the means of these effects for the various treatments. In this 
situation, the relation of the variances and means of the criterion measures 
for the treatment groups may be as represented in the figure on page 78. 
In such an experiment, the differences among the variances would be just as 
marked as the differences among the treatment means, whatever the latter 
differences might be. In a case of this kind, the over-all F-test of the null 
hypothesis might be seriously invalidated by the failure to satisfy the assump- 
tion of homogeneity of variance. 

'The safest generalization that we can make is that the assumption of 
homogeneity of variance is practically never strictly satisfied in educational 


78 THE SIMPLE-RANDOMIZED DESIGN 


and psychological experiments, but 
that in most instances the hetero- 
geneity is not marked. Fortunately, 
the form of the sampling distribution 
of the mean square ratios is not very 
markedly affected by moderate de- 
grees of heterogeneity of variance, 
and hence, the F-test may still be 
satisfactorily used in many experi- 
mental situations. 

A number of empirical studies! 
have been made of the effect upon 
the F-distribution of failure to satisfy 
the underlying assumptions. By far 
the most comprehensive and signifi- 

Y = crilerion measure cant of these studies is that which 

was conducted by Dee W. Norton? 
at the State University of Iowa. A brief summary of the Norton study is 
presented in the following section. 


X = duralion of experimental condition 


The Norton Study of the Effects of Non-normality and 


Heterogeneity of Variance 


To investigate the effects of non-normality and of heterogeneity of variance 
upon the F-distribution, Norton constructed “card populations" of 10,000 
cases each, from which samples could be conveniently drawn by means of 
electric tabulating equipment (International Business Machines). The first 
phase of this study was concerned with the situation in which the distribution 


1M. S. Bartlett, “The Effect of Non-Normality on the /-distribution," Proceedings 
of the Cambridge Philosophic Society, vol. 31 (1935), pp. 223-231; W. G. Cochran, op. 
cit., pp. 28-32; William G. Cochran, ‘Some Consequences When the Assumptions for 
the Analysis of Variance Are Not Satisfied," Biomelrics, vol. 3 (1947), pp. 22-38; 
R. A. Fisher, “On the Mathematical Foundations of Theoretical Statistics,” Philo- 
sophical Transactions of the Royal Society of London, vol. 22 (1922), pp. 309-368; R. H. 
Goddard and E. F. Lindquist, “An Empirical Study of the Effect of Heterogeneous 
Within-Groups Variance Upon Certain F-tests of Significance in Analysis of Variance," 
Psychometrika, vol. 5 (1940), pp. 263-274; H. L. Rietz, “Topics in Sampling Theory,” 
Bulletin of the American Slatistical Society, vol. 43 (1937), pp. 209-230. 

2 This study was first reported in an unpublished Ph. D. dissertation, “An Empirical 
Investigation of Some Effects of Non-normality and Heterogeneity on the F-distribu- 
tion," Ph. D. Thesis in Education, State University of Iowa, 1952. At the time of the 
completion of the manuscript for this book, Norton had begun an extension of his orig- 
inal studies and was planning to report the complete extended study in monograph 


form. 


THE NORTON STUDY T? 


of criterion measures is identical for all treatment populations, but in which 
each differs from the normal population. Six different forms of distributions, 
selected as representatives of the range of forms of distributions most fre- 
quently met in educational and psychological research, were investigated. 
Figure 4 presents histograms representing the distributions of criterion 
measures for these populations. Population I, except for a finite range and 
lack of complete continuity, is essentially a normal distribution, and was 
included as a check upon the sampling procedures employed. These distribu- 
tions have been plotted with approximately the same variance and the same 
area, so that they may be readily compared. 

From each of these populations independently, Norton selected 3,000 sets 
of k random samples of n cases each (k and n taking different values for differ- 
ent F-distributions). Each set thus corresponded to a hypothetical simple- 
randomized experiment with k treatments and n cases in each treatment 
group. For each set (or experiment) the ratio of the mean squares for 
“between-treatments” and “within-treatments” was computed, and a dis- 
tribution of these ratios was tabulated for the 3,000 experiments. An em- 
pirical distribution of 3,000 F’s was thus obtained for either one or two 
combinations of k and n for each of the six populations. 

The discrepancies, in the critical upper-tail region, between the empirical 
distributions thus obtained and the normal-theory F-distribution are described 
by the data in Table 4. The entry in a given row and column of this table 
represents the percent of mean square ratios in the empirical distribution 
(for sets drawn from the population identified at the left in the same row) 
which exceeds the percent point in the theoretical F-distribution identified 
at the top of the same column. For example, the entry 12.93 in the second 
row and fourth column of the body of the table indicates that in the em- 
pirical F-distribution for sets of 3 samples of 3 cases each (df = 2,6) drawn 
from the leptokurtic distribution, 12.93 percent of the obtained F’s exceeded 
the value 3.46, which is the 10% point in the F-distribution for the same 
degrees of freedom. — At this point, then, the discrepancy is 12.93 — 10.00 — 
2.93%. 

The data in the first row of Table 4 provide a check on the sampling pro- 
cedures employed in this study. The method of sampling? used may be 
described as one of continuous sampling without replacement from a finite, 


1The 10,000 cards (containing the criterion measures) were arranged in random 
order, and were then tabulated by fives to provide means and sums of squares for 
2,000 samples of 5 cases each, these data being punched in a summary card for each 
sample of 5. The 10,000 cards were then arranged in a new random order, and again 
tabulated by fives to produce another 2,000 summary cards. The 4,000 summary 
cards were then arranged in random order and tabulated by fours to provide the neces- 
sary data for computing the F's for 1,000 sets of 4 samples of 5 cases each. The 4,000 
summary cards were then arranged in a new random order and again tabulated by 
fours to provide another 1,000 F’s, and then finally arranged in a third and independent 
random order to provide still another 1,000 F's. 


yon ww 


I. Normal II. Leptokurtic 


R 
a 


woso 
Noor 


ee 
B- 
B2 


D 


III. Rectangular IV. Moderately Skewed 


V. Markedly Skewed VI. J-Shaped 


FicunE 4. Histograms of populations for which empirical 
F distributions were obtained in Phase 1 of the Norton Study 


THE NORTON STUDY 81 


but very large, population (N = 10,000). As will be noted in the first row 
of Table 4, the empirical sampling distribution of F for samples drawn from 
Population I contained a larger proportion of large F's than the theoretical, 
but the discrepancies are not significant at the 10% level. They are, how- 
ever, large enough to suggest that this kind of sampling may tend to produce 
slightly more large F’s than would be found in simple random sampling from 
an infinite normal population. There is no apparent logical basis for this 
suggestion, but if the suggestion is true, the discrepancies reported in the 
remainder of Table 4 are due in part to the method of sampling, rather than 
to lack of normality or heterogeneity of variance alone, or the effects of 
the latter factors are somewhat smaller than those reported. 

It is evident from Table 4 that the F-distribution is amazingly insensitive 
to the form of the distribution of criterion measures in the parent population, 
granting that the same form is common to all treatment populations. Dis- 
crepancies significant at the 5% level are found only for the leptokurtic and 
rectangular distributions, and even then the absolute discrepancies are 
quite small. Apparently, the F-distribution is practically unaffected by lack 
of symmetry, per se, in the distributions of criterion measures, but is slightly 
affected if the distribution of criterion measures is roughly symmetrical 
but either very flat or very peaked. In the latter cases, the probabilities 
read from the normal-theory F-table are too small to represent the true 
risk of a Type I error, and due allowances should be made for this in the 
interpretation of results. In such cases, judging by the results reported 
in Table 4, when the “apparent” risk (as read from the F-table) of a Type I 
error is 5%, the true risk may be as large as 8%, and when the apparent level 
of significance of an F-test is the 1% level, the actual level of significance 
may be the 2% level (approximately). 

In a second phase of his study Norton investigated the effect of hetero- 
geneity of variance alone upon the F-distribution. For this purpose he con- 
structed three card populations, all of which were like Population I (normal), 
with the same mean, but with markedly different variances. Specifically, the 
population variances were approximately 25, 100, and 225, or the standard 
deviations 5, 10, and 15, respectively. 

The instances would be rare, of course, in which the effect of the treat- 
ments in an experiment would be to bring about large differences among the 
variances without also affecting the means. Generally, both mean and vari- 
ance would be affected together, so that if the mean of one treatment popu- 
lation were higher than that of the others, the variance of this population 
would tend to be larger also. In many such experiments, the purpose of the 
experimenter is not to test the hypothesis that the treatment population 
means are identical, but rather that they lie along some specified line, either 
straight or curved. If this hypothesis is true, the form of the distribution 
of the F employed in testing the hypothesis is independent of the form of the 
line, and the distribution would be the same if the means lie along a straight 
horizontal line (the usual null-hypothesis), or along any other line (see Chap- 


09v = £V6 tV6I E82 960S 9% ejs advys-¢ IA 
9LY STO 6802 ersz  6TLIS OVE |S] Yv| Mog XG A 
Ly — L9 6 LV6l eze — L6'6V 9% |£|£| Aes xq A 
STS 820T 12702 ossz —90TS oe |S] Y | MS "poly AI 
109 SIE OZ SoS BL LP 9% | E| £ |em2uejooy IH 
999  9cv10 — £Vcc $992  cOvc 9U'€ | € |p | onamxojdoTq II 
BL OCI OFFS 1092 00'S 9% | ejej onmxyojdoq I 
19S — 966 S661  ET'SZ 880S ore |S|F [UULION] I 
WE — 9vc PLT ISI Ezg 9re-/p 
ws — ove Erz 9rt 092 9'z- fp uonpjndog |uonomdoq 
uq UJ up y fo fo 
addy, JaquimAr 
oos — 0000 000% 00'Ss ovos panto 


uounquisiq-J A1oeu| |buuoN eui ui sjulog jus319q UAD Burpee2x3 
suounquisiq poomdwg ur sojpy eonbg upew jo sjue219g 
Apnig uopoyN eu jo | espud 


v d18Vl 


THE NORTON STUDY 83 


ter 15, pages 343-347). In the second phase of his study, then, Norton 
really investigated the form of the F-distribution in the general case in which 
a true hypothesis concerning the population means is being tested under 
the condition that the population variances differ as described. 

As in the first phase of his study, Norton selected 3,000 sets of 3 samples 
of 3 cases each, but this time each set consisted of 3 samples drawn one 
from each of the three different populations (differing variances). The 
l'-ralio was again computed for each set (or hypothetical experiment), and 
the distribution of F-ratios obtained for all sets. This procedure was repeated 
with 3,000 sets of 3 samples of 10 cases each, yielding a second empirical 
F-distribution with 2 and 27 degrees of freedom. The discrepancies between 
these two empirical F-distributions and the corresponding normal-theory 
F-distributions may be inferred from the data reported in the first two lines 
of Table 5, in the same manner as in Table 4. 

It is apparent from these results that marked heterogeneity of variance 
has a small but real effect on the form of the F-distribution. Tf one used 
the probabilities read from the normal-theory F-table in interpreting the 
results of an experiment with this degree of heterogeneity, he might think he 
was making a test at the 5% level when actually he was making it at the 
7% level, or might think he was testing at the 1% level, when actually he 
was doing so at the 2+% level of significance, etc. Accordingly, where 
marked (but not extreme) heterogeneity is expected, it is desirable to allow 
for the discrepancy by setting a slightly higher “apparent” level of signifi- 
cance for this test than one would otherwise employ (the “apparent” level 
being that indicated by the F-table). For example, if one wished the risk 
of a Type I error to be less than 5%, he might require that the obtained F 
exceed the 2.5% point in the normal-theory F-distribution. The “apparent” 
level of significance would then be the 2.5%, but the actual level would be 
the 5% level. 

Ina third phase of his study, Norton investigated the effect of heterogeneity 
in form of distribution (accompanied by only a slight degree of heterogeneity 
of variance). In each “experiment” of this phase, one sample was drawn 
from each of three populations. The first population was the same as Popula- 
tion V in Figure 4, with a ø of 5.0. The second population was approximately 
normal, but with a limited range (5 o’s) and a « of 7.4. The third population 
was exactly like Population V, except that the skew was in the opposite direc- 
tion. Experimental situations of this kind are quite frequently met in edu- 
cational and psychological research, when, due to the limited range of the 
criterion test (that is, to the effect of a low test “ceiling” or a high **floor"), 
the distribution of test scores is skewed in one direction for the “low” treat- 
ment group, and in the opposite direction for the “high” treatment group. 

As previously noted, variations in form and variance of the type just 
described would usually also be accompanied by variations in the population 
means. That is, such variations in form and variance would most often be 
found in “trend” studies where the hypothesis to be tested is that the treat- 


28°0 06°T £65 occ org LEEL eee 06°97 SELE 9€€ | OL| PF 000€ 
SOOUBLIB A. 
pue suog 

v9'0 9rc Let ££9 coor  99'CI EPT ES'8T T9'9v ee JEt snoousZo19]9H | eee 

60 6T 861 £6t 18'9 LETI 69'Ic STI 8r6v sre |9|e€ EEEE 
SUNO 

SUO ZOT SOT 8S ZL'9 ETSI ISe —S99c — 69 6v 9% | gjej snooueorjeH | ccce 

1&0 OTT  00c ese- 959 SLIT  991Ic 889%  9Vv6v LESZ | O01) E 0008 
SOOUBLIG A 

9£'0 ru els 6c 9cL LV EL TOES 9r7c [ius 9% |e snoouoZo19)9 | egge 


LL9 LO'S 9cy 


Muni vec COL EFT £08" 98'¢=Jp 


Eg'ST 096 66'L ers Lv TET SOL L9'l 098° sc=Jp 

T06 679 6vs Ye Set ISZ LI owt oU lec-lp 

PETE OL 9¢°9 OLY 89'E 0rc 6r'l cet 9cL Src- fp n uly uondiiosaqy $S 

00'137 PSHT COOL 97L yrs ons els 9rt 087 9'z- fp "ON 
ispod 


oro 


00°0L 00°06 00'cc 0070€ quassa d 


uounquusiq-J 1034} [puuoN Əy} u! sjulog 4UeDIEg USAID Burpəə2xF 
suoynquysig [paridui ur soyoy e10nbgs upew jo sjus24ad 
Apnig uoLoN eui 40 y pup 'e ‘g sespyg 


S dl1üVvi 


THE NORTON STUDY 85 


ment-population means lie along some specified line, other than a straight 
horizontal line. So long as the hypothesis being tested is true, however, 
the form of the distribution of the F used to test the hypothesis is independent 
of the nature of the line. Accordingly, in the third phase of his study, Norton 
investigated the general case in which amy true hypothesis concerning the 
population means is being tested, but in which the forms and variances of 
the populations differ in the manner described. 

In this third phase of his study, Norton selected 3,333 sets of 3 samples 
of 3 cases each (df — 2,6) and also 3,333 sets of 3 samples of 6 cases each 
(df = 2,15). The discrepancies between each of the empirical F-distributions 
and the corresponding normal-theory F-distributions may be inferred from 
the data in lines 3 and 4 of Table 5. As in the case of heterogeneous variance 
only, the discrepancies in the extreme upper-tail region are significant, but 
small, and due allowances should be made for these discrepancies in inter- 
preting the results of an F-test. Again it is remarkable how slightly 
this degree of heterogeneity of form and variance affects the F-distri- 
bution. 

In the fourth phase of the study, Norton investigated the type of situation 
graphically portrayed on page 78 — a situation frequently met in learning 
experiments and trend studies in psychological research. In this phase of 
the study, each “experiment” involved 4 samples, drawn one from each of 
four different populations. The first population was the same as Population 
VI in Figure 4, with a ø of 2.2. The second was the same as Population V 
in Figure 4, with a ø of 6.2. The third population was like Population IV, 
except that the e was 10.0, rather than 4.9, and the fourth population was 
like Population I, except that the c was 14.9, rather than 5.0. Thus the 
variance of one of the populations was almost 45 times that of another, so 
that the heterogeneity both of form and of variance was extreme. 

As in previous instances, Norton investigated the specific case in which 
the population means are identical, but as before, this is equivalent to inves- 
tigating the general case in which any true hypothesis concerning the popu- 
lation means is being tested, the forms and variances of the populations being 
as described. 

In this final phase of the study, Norton drew 3,333 sets of 4 samples of 
3 cases each, and 3,000 sets of 4 samples of 10 cases each. The discrepancies 
between the two empirical F-distributions thus obtained and the correspond- 
ing normal-theory F-distributions may be determined from the data in the 
last two rows of Table 5. 

Even with the violent departure from the theoretical requirements repre- 
sented in this fourth phase of the study, the F-distributions still represent 
a fairly good fit to the normal-theory distribution. The discrepancy between 
the distributions is highly significant at nearly all points, but the absolute 
discrepancy is still not large enough to render the ordinary F-table valueless 
insuch situations. True, the actual risk of a Type I error may be larger than 
that indicated by the ordinary F-table (about twice as large at the 196 point), 


86 THE SIMPLE-RANDOMIZED DESIGN 


but if one knows that this is the case and makes due allowance for it, one 
may still use the normal-theory table to good advantage. 

The results of the Norton study should be extremely gratifying to anyone 
who has used or who contemplates using the F-test of analysis of variance in 
experimental situations in which there is serious doubt about the underlying 
assumptions of normality and homogeneity of variance. Apparently, in the 
great majority of situations, one need be concerned hardly at all about lack 
of symmetry in the distribution of criterion measures, so long as this distribu- 
tion is homogeneous in both form and variance for the various treatment popu- 
lations, and so long as it is neither markedly peaked nor markedly flat. Most 
non-normal distributions met in practice are probably non-normal primarily 
because of lack of symmetry rather than because of lack of the “normal” 
degree of peakedness. In general, the F-distribution seems so insensitive to 
the form of the distribution of criterion measure that it hardly seems worth- 
while to apply any statistical test to the data to detect non-normality, even 
though such tests are available. Unless the departure from normality is so 
extreme that it can be easily detected by mere inspection of the data, the 
departure from normality will probably have no appreciable effect on the 
validity of the F-test, and the probabilities read from the F-table may be 
used as close approximations to the true probabilities. 

The findings of the Norton study are not quite so encouraging with refer- 
ence to situations in which the treatment populations are heterogencous, either 
in form, or in variance, or in both. However, the heterogeneity must be 
quite extreme to be of any serious consequence. While statistical tests 
of heterogeneity of variance are available (one is presented in the following 
section), there will be relatively few situations in which any such test is 
required. In general, unless the heterogeneity of either form or variance is 
so extreme as to be readily apparent upon inspection of the data, the effect 
upon the F-distribution will probably be negligible. In general, when the 
heterogeneity in form or variance is “marked” but not “extreme,” allowance 
may be made for this fact by setting a higher “apparent” level of significance 
for the tests of treatment effects than would otherwise be employed. In cases 
of very marked heterogeneity, for example, if one wishes the risk of a Type I 
error not to exceed 5%, he might require the effect to be “significant” at the 
2.5% level, or if he wants the risk of a Type I error not to exceed 1%, 
he might set the “apparent” level of significance of the test at 0.1%. 

The preceding is not meant to imply, of course, that allowance may always 
be made for heterogeneity of form or variance by “corrections” of the type 
suggested. On the contrary, there undoubtedly are some situations in psycho- 
logical research in which the heterogeneity in either form or variance, or both, 
may be considerably more extreme than in any of the hypothetical situations 
investigated by Norton. In these situations it is not known what “correc- 
tions” should be applied to the ordinary F-test, and special procedures to 
be considered in a later section (the use of transformations) must be employed 
in analyzing and interpreting the results. 


THE NORTON STUDY 87 


The Test for Homogeneity of Variance: Several tests * of homogeneity of 
variance have been suggested, of which the most useful and convenient is 
perhaps that devised by M. S. Bartlett. Before considering this test, we may 
note that an estimate of the population variance (s?) can be derived from 
each of the a treatment groups by 


ay 
estd o = X (X — Mjy/n; — 1. 
We have noted also that a better estimate of c? is provided by the mean 
square for within-groups (msw), which is essentially an “average” of the 
estimates obtained from the a groups. 
Bartlett's test is based upon the difference between the logarithm of this 
“average” estimate and the sum of the logarithms of the individual estimates. 
More precisely, Bartlett has shown that the expression 


nj 

2304 je hme Ait i (2X - Mj J 
[e] pz n;—1 

is distributed approximately as x* with a — 1 df, the notation being pre- 

cisely that we have used earlier. 


1 2 1 j 
The C is a constant = 1 + > a5 
3(a — 1) P (n;j—1) EG uw ] 
Yin) 


but need not be computed if the x? for the expression before division by C 
is not significant, since C is always larger than 1.00. The value of 2.3026 is 
the ratio between a natural and a common logarithm. 

In the example on page 57, the application of this test is as follows: 


Treatments 
Ai As As Ay 
n= 5 4 6 4 


n; 2 
a- My EX- fas 21320 1475 600 2.00 
"d 
YXa« —Mj/(n;—1)-3.30 492 120 0.67 
reg’ xo - Mj?/(n; - »] — 0.51851 0.69197 0.07918 9.82607—10 


(n;— 1)log [ Y — Miy/(n; »] — 2.07404 2.07591 0.39590 9.47821—10 
1 A modification of the Bartlett test which is slightly more accurate when the n;'s are 


very small has been provided by H. O. Hartley, “Testing the Homogeneity of a Set of 
Variances," Biomelrika, vol. 31 (1940), pp. 249-255. 


88 THE SIMPLE-RANDOMIZED DESIGN 


Éu- log| Xx - M)'/(n, - 0] = 4.02406 
a 
(N — a) logms, = 15 X .38021 = 5.70315 


m 2.3076 (5.20315 — 4.02406) = am 


20% value of x* for 3 df = 4.642 
Hence, am is not significant, 


As previously noted, the usefulness of this test is quite limited. In view 
of the results of the Norton study, it is apparent that the test is needed 
at all only when the treatment groups are quite small. When the treatment 
groups are large, one has only to tally the distributions of criterion measures 
for the various treatment groups to tell by inspection if the heterogeneity is 
sufficiently marked to be likely to have any appreciable effect on the F- 
distribution. When the treatment groups are small, the Bartlett test may 
be needed to determine whether or not any heterogeneity of variance char- 
acterizes the treatment populations, but the Bartlett test will not indicate 
how marked the heterogeneity really is, nor how seriously it affects the 
validity of the usual F-test. 

The Use of Transformations: The assumptions of homogeneity of variance 
and of normality of distribution are often intimately related, so that a serious 
failure to satisfy one condition is accompanied by a failure to satisfy the 
other. For example, in any instance in which there is a definite relation- 
ship between the means and variances of the treatment groups, there is likely 
also to be a tendency for some of the distributions to be sharply skewed to 
the right. Sometimes it is possible to overcome such difficulties by transform- 
ing the criterion measures into derived measures whose variances are more 
homogeneous and whose distributions are also more nearly normal. For ex- 
ample, the standard deviations of the treatment groups may tend to be linearly 
related to their means, as could be true in the type of situation considered on 
page 78. In this case, if the logarithm of X or of (1 + X) is substituted for X 
as a criterion measure, it may be found that the variances of the transformed 
measures are much more homogeneous than those of the original measures and 
that the transformed distributions are much more nearly normal as well. In 
that case, the F-test of significance applied to the transformed measures will, 
of course, be more valid than that applied to the original measures. Other 
functions of X that have been frequently employed in transformations are its 
square and square root. Transformations of the type Y = f(X) are appropriate 
only when there is a relationship between the means and variances of the 
treatment groups; or, if the means of the treatment groups do not differ 
markedly, only when the distribution of the criterion measures is of the same 
form for all treatments. If the treatments cause the variances of the criterion 
measures to differ but do not create differences among the means, or if they 


THE NORTON STUDY 89 


cause the variances to differ independently of the differences among the means, 
then no valid transformation is possible. Again, if the distributions are 
homogeneous in variance but differ in form of distribution, no valid trans- 
formation is possible. Sometimes the variances are homogeneous and all dis- 
tributions are of the same form, but this form is one that cannot be normalized 
by a transformation of the type Y—f(X). In this case, it may be possible 
to normalize the distribution by means of an “area transformation” based 
on a large independent sample from one of the treatment populations, or 
from a population known to be closely similar to this population. If the 
treatments have identical effects, the distribution will then, of course, be 
normalized for all treatment populations. This is done by transforming each 
value of the original measure into a derived measure whose percentile rank in 
a normal distribution of selected mean and variance is the same as the per- 
centile rank of the original measure in the independent sample. Students of 
educational and psychological measurement will recognize the McCall "T- 
score” as a transformation of this type. Scores on educational and psycho- 
logical tests are frequently normalized in this fashion, and when thus normal- 
ized may more appropriately be used in analysis of variance tests than could 
the original measures. "There are many important types of psychological data 
which require transformation before the test of analysis of variance may be 
validly applied, and the subject of transformations is therefore of considerable 
importance to many psychological research workers. A thoroughgoing treat- 
ment of this subject is beyond the scope of this text but helpful discussions of 
this problem are readily available elsewhere.' 

Difficulties due to the non-normality of the criterion measures may some- 
times be avoided by employing a different criterion variable than that origi- 
nally contemplated. Very frequently, in educational and psychological 
research, the choice of a criterion is quite arbitrary, and several alternate 
criteria may be available, all of which are about equally appropriate to the 
general purpose of the experiment, but some of which are more nearly normally 
distributed than others or some of which can more readily be transformed to 
normally distributed measures. Research workers in psychology have often 
caused themselves needless trouble by failing to give consideration to this 
possibility in the early stages of planning their experiments. 

There are some experimental situations in which the data show extreme 
deviations from normality of distribution, and which are not amenable to 
transformation. This is particularly likely to be true when the criterion 


! Students particularly interested in this subject should see C. G. Mueller, “ Numeri- 
cal Transformations in the Analysis of Experimental Data,” Psychological Bulletin, 
vol. 46, no. 3 (May, 1949), pp. 198-223. This is perhaps the most comprehensive of 
available discussions of the use of transformations in psychological research and con- 
tains an excellent bibliography of forty-nine references. A useful discussion of trans- 
formations is also found in Oscar Kempthorne, The Design and Analysis of Esperi- 
ments (New York: John Wiley and Sons, Inc., 1952), pp. 153-158. 


90 THE SIMPLE-RANDOMIZED DESIGN 


measures represent frequencies of occurrence of a certain type of behavior, 
such as frequency of aggressive behavior, frequency of stuttering, number of 
trials needed to run a maze, number of trials required to learn to perform a 
task, etc. With such data, the distribution is sometimes J-shaped, with a 
large number of undistributed cases at the zero point on the scale. Such non- 
continuous or discrete distributions are frequently not amenable to successful 
transformation, and the tests of analysis of variance may therefore not be 
applicable to the data. Fortunately, considerable interest has been shown 
recently in the development of non-parametric tests, that is, tests that make 
no assumptions about population parameters, and some effective work has 
been done in the development of such tests.! Non-parametric tests will un- 
doubtedly soon occupy a very important place in psychological. research. 
However, all of these tests are less powerful than those assuming normality 
and homogeneity of variance. lt is highly probable, therefore, that the more 
powerful tests of analysis of variance will continue indefinitely to be employed 
in the majority of experiments performed in education and psychology, or in 
all situations in which the necessary assumptions seem satisfied. 


Testing the Significance of the Difference in Means for 


Individual Pairs of Treatments 


In a simple-randomized experiment, our ultimate interest is usually in 
the differences between individual treatments. For instance, we may wish to 
know if A; differs from As, or if A; differs from A4, or if A» differs from any 
and all of the other treatments, etc. The F-test of the over-all null hypothesis 
is regarded as essentially a way of applying a test of significance simultane- 
ously to the observed differences for all possible pairs of treatments. If 
the F of this over-all test proves to be non-significant, we know at once 


1 A non-parametric test based on runs is described by A. Wald and J. Wolfowitz, “On 
a Test of Whether Two Samples Are from the Same Population," Annals of Mathe- 
matical Statistics, vol. 11 (1940), pp. 147-162; The “ U-test” is described by H. B. Mann 
and D. R. Whitney, “On a Test of Whether One of Two Random Variables is Sto- 
chastically Larger than the Other," Annals of Mathematical Statistics, vol. 18 (1947), 
pp. 50-60. Both of these tests are concerned with the hypothesis that the two samples 
in question are drawn from populations with a common cumulative frequency distribu- 
tion and both are influenced by differences in form or differences in central tendency 
between the two distributions. Excellent general discussions concerning non-para- 
metric inference are those by S. S. Wilks, “Order Statistics," Bulletin of the American 
Mathematical Society, vol. 54 (1948), pp. 6-50; and Henry Scheffe, ‘‘Statistical Inference 
in the Non-parametric Case," Annals of Mathematical Statistics, vol. 14 (1943), pp. 305— 
332. Both of these papers are quite comprehensive and provide extensive bibliographies. 
See also Alexander F. Mood, Introduction to the Theory of Statistics (New. York: Mc- 
Graw Hill Book Company, Inc., 1950), Chapter 16, “Distribution Free Methods,” pp. 
385 ff. 


TESTING THE SIGNIFICANCE OF THE DIFFERENCE IN MEANS 9 


that all observed differences among individual treatments are simultaneously 
attributable to chance alone. Knowing this, there would be little point in 
applying the ¢-test successively to the differences for individual pairs of 
treatments. If we did so, we would be guilty of the fallacy discussed on 
pages 48-49. Accordingly, if the over-all F proves non-significant, we usually 
go no further with the analysis. 

However, if ms4/ms, does prove significant, it does not necessarily follow 
that the population mean for each treatment differs from that for every other. 
In an experiment involving four treatments, for example, it is quite possible 
that three of the treatments have identical average effects on the population, 
that only one differs from the others, and that it is the effect of this one 
treatment which accounts for the significant F. Accordingly, before con- 
cluding that any one treatment differs from any other, we must apply a 
test of significance to the difference in the observed means for these two 
treatments alone. The preliminary over-all F-test is then regarded simply 
as a way of determining whether or not it is necessary and legitimate to 
apply the t-test to individual pairs of means. If the over-all F proves sig- 
nificant, we nearly always go on to these more specific tests. 

We know that the error variance of the difference between the means of 
two random samples (inm) drawn from populations with the same variance 


(c?) is given by 
e - e As L) 
M\—My ni Ta 


in which m and n; are the numbers of cases in the samples involved. We know 
also that in a simple-randomized experiment, on the assumption of homo- 
geneous variance, the mean square for within-groups (ms,) is an unbiased 
estimate of the common variance (0?) of the treatment populations (page 60). 
Accordingly, an unbiased estimate of the error variance of the difference 
between two observed treatment means is given by 
est'd c - mse + 2) 
M1-M3 h 

and on the assumption of normality, the difference may be tested for signifi- 
cance by means of 

im, (19) 


for which the degrees of freedom are those for ms,, that is, N-a. For the 
illustrative exercise on page 57, this t for M; — M» is 


A E o, 


with 15 degrees of freedom. Suppose we had decided to make the test at 
the 1% level of significance. For 15 degrees of freedom, the smallest ¢ which is 


92 THE SIMPLE-RANDOMIZED DESIGN 


significant at the 1% level is 2.947. Hence, we would regard this particular 
difference of 2.55 as non-significant. 

In situations of this kind, the practice has been very common of comput- 
ing the t for each of the possible differences among treatment means, con- 
sulting the /-table to determine the relative frequency with which each value 
of t would be exceeded (in absolute amount) in the long run if the hypothesis 
were true, and reporting this probability for each difference separately. This 
is equivalent to reporting the mazimum level at which each difference is 
significant. Thus, in effect, one difference might be reported as significant at 
the 19.13% level, another at the 0.18% level, another at the 5.22% level, etc. 
This means, essentially, that the experimenter selects the level of significance 
for each test after the t for that difference has been computed. This prac- 
tice is logically inconsistent. Strictly one should decide what risk he is 
willing to take of rejecting a true hypothesis — whether it be the over-all 
null hypothesis or a more specific null hypothesis concerned with a particular 
pair of treatments — before he examines the experimental results. "That is, 
he should decide in advance on the level of significance of all tests that he 
is to make, taking into consideration the consequences of both Type I and 
Type II errors. If the over-all F is then not significant at this level, he need 
go no further with the analysis. In this case, the over-all null hypothesis 
may be retained, which, of course, means that the null hypothesis concern- 
ing any particular pair of treatments may be retained also, On the other 
hand, if the over-all F is significant, he should report for each difference 
whether or not it exceeds the value required for significance at the selected level. 
That is, he should report each difference as simply either “significant” 
or “non-significant” in all instances with reference to the selected level 
only. 

The practice of reporting a different (maximum) level of significance or a 
separate probability for each individual difference is objectionable because 
it ignores Type II errors, or else implies that one is willing to accept different 
risks of a Type II error in testing different specific hypotheses (see pages 
66-70). It is undesirable also because it encourages the use of the ¢ or of 
the corresponding probability as a measure of the "importance" of the 
difference. This may result from a tendency to attach to “significance” 
as a statistical concept some of the connotations of “significance” as a gen- 
eral term. It should be remembered that the purpose of the t-test of signifi- 
cance is simply to decide categorically whether or not to reject the null 
hypothesis, and not to estimate the magnitude of, or to describe the importance 
of, the corresponding difference in population means. The problem of estima- 
tion should not be confused with the problem of testing hypotheses. The best 
estimate of the difference in population means for any two treatments is the 
difference in the corresponding means observed in the experiment, regardless 
of the maximum level at which the associated / is significant. Quite obviously, 
the smaller of two differences can be significant at a higher level than the 
larger, if the smaller difference is based on larger numbers of cases. Even 


TESTING THE SIGNIFICANCE OF THE DIFFERENCE IN MEANS 93 


though the t's for both differences have the same degrees of freedom, the 
maximum levels at which they are significant are obviously not indicative 
of the relative magnitudes of the corresponding population differences. One 
should never attempt, then, to draw any inferences about the relative impor- 
tance or magnitude of a difference from the maximum level at which the 
difference is significant nor does one need to compute the £ for a difference in 
order to draw any inferences concerning its magnitude or importance. The 
corresponding to an obtained mean or difference may be needed to describe its 
reliability as an estimate of the corresponding population value, or to estab- 
lish a confidence interval for the population value, but the / is not needed 
for the single-valued estimate itself. Regardless of the value of the cor- 
responding t, the obtained mean or difference is the best estimate of the 
corresponding population value. 

If all treatment groups are of the same size, it is not necessary to compute 
the £ for any difference. In this case, one may compute the “critical differ- 
ence" corresponding to the selected level, and report as significant all differ- 
ences exceeding this “critical difference." The formula for computing the 
critical difference is obtained by solving for d = M; — M: in (19), which yields 


(20) 


l representing the value of | which is just significant at the selected level 
for the given degrees of freedom. 

To illustrate the recommended procedure, suppose that an experiment has 
been performed with five treatments, all treatment groups being of the same 
size (n = 10). Suppose it is decided to take a 1% risk of rejecting a true 
hypothesis; that is, suppose it is decided to make the tests of significance 
at the 1% level. For 4 and 45 degrees of freedom, the 1% value of F is 3.83. 
Suppose the mean square ratio is then computed and found to be 8.32. Ac- 
cordingly, the over-all null hypothesis is rejected. The next step is to deter- 
mine from the /-table what value of ¢ would be significant at the 1% level 
for the degrees of freedom for ms. (Note that it would not be consistent 
to employ a different level of significance for the t-tests than was employed 
for the over-all F-tests.) Suppose that ms, = 28.80. The critical value of 
tat the 1% level for 45 degrees of freedom may be read from the normal proba- 
bility integral table, and is 2.58. (Note that this is a two-tailed test.) Ac- 
cordingly, by means of (20) we find the critical difference is 


U2X 28.80 
d — 2.58 eye 6.20. 


Suppose that the individual means and differences have been computed and 
found to have the values indicated in the table which follows. (The entry in 
each cell in the right-hand table is the difference between the means for the 
treatments indicated in the margins of the table.) 


94 THE SIMPLE-RANDOMIZED DESIGN 


Means As As A. As 
M; = 47.2 Ai 3.8 10.5* 6.9* 14 
Mi = 43.4 A: GT Si 5.2 
M; = 36.7 As 3.6 11.9% 
M; = 40.3 Ag 8.3* 
M; = 48.6 Table of Differences 


We may then indicate with an asterisk all differences that exceed 6.20, that 
is, all differences significant at the 1% level. The specific null hypothesis 
would then be rejected for each of the pairs of treatments thus identified. 

Some might argue that a better procedure would be to compute the £ for 
each difference, to report with each difference in the table of differences 
the probability corresponding to the /, and then to identify as "significant" 
all differences for which the risk of a Type I error is less than the selected 
risk, that is, all differences for which the reported probability is less than 
a pre-selected value. All specific null hypotheses would then be tested at 
the same level of significance, keeping constant the risk of accepting a “seri- 
ously false" null hypothesis (assuming that “seriously false" has the same 
meaning for all specific hypotheses). This procedure would have the ap- 
parent advantage of accomplishing all that is accomplished by the recom- 
mended procedure and, at the same time, of revealing that certain specific 
hypotheses may be rejected with * more confidence" than others. However, 
it is impossible to show in operational terms that any real advantage is gained 
by the extra labor involved in computing the / and reporting the probability 
for each difference individually. The rejection of the null hypothesis in 
each case is categorical — either the hypothesis is rejected or it is retained — 
there are no degrees of rejection. The subsequent operations — the ad- 
ministrative decisions made and the actions taken on the basis of the test 
of significance — will be exactly the same whether the hypothesis is re- 
jected at the 0.1% level or at the 10% level, so long as it is rejected at all. 
The recommended procedure is simpler and more economical and avoids any 
danger that the probabilities or maximum levels of significance will be used 
as measures of the relative importance of the differences. 

The only valid justification for the practice of reporting a separate proba- 
bility for each difference is that the reader of the research report may then 
more conveniently employ any level of significance he prefers in interpreting 
the results. In this writer's opinion, this argument, while valid, is not ə 
sufficient reason for the practice (but this is an opinion by no means shared 
by all statisticians). The group n's and the error mean square should of 
course be reported, so that the reader may make his own tests if he wishes, 
but the recommended procedure is less likely to result in misinterpretations 
by the less sophisticated readers of the report. 


TESTING THE SIGNIFICANCE OF THE DIFFERENCE IN MEANS 95 


If the number of cases varies from group to group, the recommended pro- 
cedure in the form illustrated cannot be followed. If there are several treat- 
ments, one may first compute the critical difference for the two treatment 
groups with the two smallest numbers of cases. All differences exceeding 
this value may then be immediately identified as significant, even though 
based on larger numbers of cases. The critical difference for the two largest 
groups may then be computed. All differences smaller than this may at once 
be identified as non-significant. Specific critical differences need then be 
computed only for the remaining comparisons.' 

If we then wish to rank the treatments on the basis of their estimated 
relative potency, we may do so on the basis of the observed treatment means. 
In the illustration just used, the best estimate we can make is that A; is the 
most effective treatment, A; the next most effective, etc. We need not 
refer to any ts to make these estimates, and the estimates are the best avail- 
able regardless of the sizes of the treatment groups or of the estimated stand- 
ard errors of the means. Knowledge of the standard errors would help us 
judge the reliability of the ranking, but would not change the estimated 
ranks. 

We noted earlier in this chapter that in a simple-randomized experiment, 
it is not legitimate to select the largest of the observed differences among 
treatment means and to apply the ordinary t-test to this difference. If this 
were done, the risk of rejecting a true hypothesis would be very much larger 
than that indicated by the ( table. A similar difficulty still exists even though 
a significant over-all F has already been obtained. Regardless of the out- 
come of the over-all F-test, it is still true that if one selects the largest of 
the observed differences for individual pairs of means and applies a t-test 
to this difference, the risk of rejecting a true null hypothesis specific to these 
two treatments is not that indicated by the probability read from the (-table. 
However, if one sets the same level of significance for both the over-all F-test 
and the /-test for the largest difference, and then finds that both the F and 
the t are significant at this level, one may safely say that the risk of rejecting 
a true hypothesis is no larger than that indicated by the level of significance 
set. Exactly what is the risk under these circumstances of rejecting a true 
hypothesis concerning only these particular two treatments, we cannot 
say. The logical difficulties encountered in attempting a rigorous inter- 
pretation of the joint outcome of these F and t-tests is far too involved to 
be considered here. However, statisticians are agreed that if the over-all 
F proves significant, one may safely, for any practical purposes, interpret 
the (-tests applied to individual differences as if they were independent t-tests. 

It should be noted that, particularly when a large number of treatments is 
involved, the over-all F-test is not very sensitive to the effect of a single 


1 Students interested in pursuing this problem farther should consult John W. Tukey, 
“Comparing Individual Means in the Analysis of Variance," Biometrics, vol. 5, no. 2 
(June, 1949), pp. 99-114. 


96 THE SIMPLE-RANDOMIZED DESIGN 


treatment which differs from all the others — all the others being practically 
identical in effect. This is just another way of saying that the more nearly 
the over-all null hypothesis is true, the greater is the risk of a Type II error. 
1f only one treatment really differs from the others, it could readily happen 
that the ¢ corresponding to the difference between this particular treatment 
mean and the average of the remaining treatment means is “significant,” 
even though F of the over-all test is non-significant. If this happens in an 
actual experiment, however, we cannot be sure whether the significant | is 
due to a real difference between this one treatment and the other treat- 
ments, or is to be explained in the manner suggested on pages 48-49, About all 
that could be done in a situation of this kind is to regard the observed results 
as a suggestion, rather than as evidence, that the given treatment differs 
from the others, and to design a new and independent experiment to de- 
termine whether or not this one treatment does really differ from the others. 


The Significance of the Difference Between Two Sample 


Means When the Population Variances Differ 


As has been noted earlier, there are many important applications of the 
simple-randomized design in educational and psychological research, in which 
the variances of the treatment populations differ markedly. However, unless 
the heterogeneity of variance is extreme, one may still employ the usual over- 
all F-test of the treatments effect, making due allowance for the fact that 
the probabilities read from the F-table will tend to underestimate the true 
probabilities. However, the fact that the usual F-test is still valid does not 
mean that the usual t-test is valid also — that is, the t-test based on the mean 
square for within-treatments computed from the entire experiment. Sup- 
pose, for example, that in a certain experiment involving four treatments, 
two of the treatment populations have very nearly equal variances, but the 
other two treatment populations have markedly larger and differing variances. 
Quite obviously, in this case, the error variance of the difference in the ob- 
tained means for the first two treatments would be seriously overestimated 
by 2ms,/n and a t-test based on this error estimate would be seriously biased. 
When marked heterogeneity of variance is suspected, therefore, the t-test 
of the difference between the means of any two treatment groups should be 
based on the data for those two treatment groups alone, rather than on the 
mean square for within-treatments computed from all treatment groups. If 
the particular treatment populations involved presumably have nearly equal 
variances, the usual t-test of the difference between the means of two inde- 
pendent random samples may be employed. However, if the two treatment 
populations presumably have markedly differing variances, only an approxi- 
mate t-test is possible. 

Before presenting this approximate test, we may note that the standard 
error of the difference between the means of two independent random samples 


SIGNIFICANCE OF DIFFERENCE BETWEEN TWO SAMPLE MEANS 27 


is always given, regardless of the variances of the parent populations or the 
forms of their distributions, by 


Omm: = Val + ir; 


and an unbiased estimate of the standard error of the difference may always 


be obtained by 
» OVE LEUNU MERE NE 
est'd auia = nin — 1) + m(n: — 1) en 


When n; and n; are both large, the ratio 
M; — Ms 


Zdi 2 
Mamie D ma D 


is very nearly normally distributed and the normal probability table may be 
used to interpret this ratio in testing the significance of the difference, even 
though the population variances differ. This is the “significance ratio” 
technique traditionally employed in large sample theory. If either m or na 
is small, this ratio is not normally distributed, nor is it distributed as ¢ if the 
population variances differ. However, if n; and n; are both small and differ- 
ent, one may employ a test suggested by Behrens and discussed in Fisher 
and Yates’ Stalistical Tables (Oliver and Boyd), 1948, page 3. Behrens 
has shown that the difference between the means (M; — Mz) of two samples 
of N, and IN cases respectively, drawn from populations with different vari- 
ances, is significant if 

Mı — M: 


Vest'd oi + est d ozre 


where d is a value tabled (see page 46 in Fisher and Yates) for the 5% and 
1% levels and dependent on the three values, 


2d, 


m 7 (Ni — 1), 
m = (Ns — 1), 
and 
i est’d e. 
0 — arctan GERBER 


A somewhat less exact but in most cases a quite satisfactory test for prac- 
tical purposes has been suggested by Cochran and Cox. To apply this test, 
we let df, and df; be the numbers of the degrees of freedom corresponding to 
est'd c2, and est’d c5, respectively. We then determine from the t-table the 


1'W. G. Cochran and Gertrude M. Cox, Experimental Designs (New York: John 
Wiley and Sons, Inc., 1950), pp. 92-93. 


98 THE SIMPLE-RANDOMIZED DESIGN 


values ż and f of t which are significant at the selected level for df; and df. 
respectively. We then compute 
p = ESCA oin)h + (est'd oir) te 
est’d cir, + est'd eir. d 


and use (21) to estimate ay, ar. 
Tf, then, 


Q2) 


Mı- M: ' 
est'd aw, ar; pu 


we may say that M; — M, is significant at the desired level. 
For example, in the computational exercise on page 57, est’d cir; = 1.23, 
est'd ois = .20, est'd o. ar; = 1.43, and est'd er, ar, = 1.196. Accordingly, 


MEM, 525 — 
Gru. Lise 
At the 5% level of significance for 3 degrees of freedom, tı = 3.182 and for 
5 degrees of freedom t = 2.571. Hence, 


.. 1.23 X 3.182 + .20 x 2.571 
1.23 4- .20 


Thus we find, since 4.39 > 3.10, that M» — M; is significant at the 5% 
level. 

When n; = n; = n, it follows that 4 = t and t’ = & = t. Thus, when the 
two groups are the same size, the Cochran-Cox test reduces ! to the ordinary 
t-test for the difference of the means of two independent random samples, but 
with the number of degrees of freedom equal to n — 1 instead of n; + n — 2 = 
2(n—1). 'This shows clearly that the Cochran-Cox test is less sensitive than 
that which assumes homogeneity of variance. 


t 


= 3.10. 


Types of Applications of the Simple-Randomized Design 


to Experimental Data 


The treatments classification in a simple-randomized design may be of 
either of two general types. On the one hand, the “treatments” may repre- 
sent different degrees, or amounts, or intensities, etc., of a single experimental 
variable or factor. For example, the treatments may represent different 
intensities of illumination under which the subject reads, or various durations 
of a certain stimulus, etc. We will call such a classification a “single-factor” 
classification. On the other hand, the treatments may represent complex 


1 Otherwise viewed, when n; = n; the Cochran-Cox test is essentially equivalent to 
pairing the measures from the two samples on a random basis, finding the difference 
between the measures in each pair, and applying the simple t-test to the mean of this 
random sample of n differences. 


APPLICATIONS TO EXPERIMENTAL DATA 99 


combinations of a variety of factors or variables many of which may not 
even be identified. For example, the treatments in an educational experiment 
may represent two methods of teaching fourth-grade arithmetic, one of 
which may be described as a “workbook” method, the other as the “ tradi- 
tional" method, and there may be many respects instead of only one in 
which these methods differ. Method A, may represent greater amounts of 
some components than Method A, and smaller amounts of other components, 
so that there may be no clear basis for ranking the treatments in any order. 
A treatment classification in which the treatments are thus complex and 
unordered will be called a “categorical” classification. 

The purpose of an experiment of the single factor type may be simply to 
determine if the experimental factor is one on which the criterion variable 
depends, or if there is any relationship between the experimental and con- 
trol variables. In this case, whether two or more than two “levels” (amounts, 
degrees, intensities, durations, etc.) of the experimental variable should be 
represented in the experiment depends upon how the experimental variable is 
related to the criterion variable, if it is related at all. It is possible that 
the relationship is such that for any given value of the experimental variable 
(Y), the accompanying value of the criterion variable (X) is always equal 
to or larger than that for any lower value of Y. In this case the sequence 
of Y values would be described as a “monotonic increasing sequence.” A 
monotonic sequence may be either increasing or decreasing. For convenience 
in later discussions, we will say that the relationship between X and Y is 
“monotonic” if there is a monotonic sequence of X values for successive 
values of Y. Now suppose we are planning an experiment in which we are 
quite certain on a priori grounds that the relationship, if any, between X and 
Y is monotonic, and the purpose of the experiment is simply to determine if 
any relationship exists. In that case, the obvious thing to do is to compare 
only two levels of Y in the experiment; the more widely separated these two 
levels, the better. If we were to include a third or additional treatment at 
intermediate levels of Y, we would only lower the power of the test of the 
over-all null hypothesis. However, in many situations we may be uncertain 
that the relationship between X and Y is monotonic. That is, we may recog- 
nize the possibility that X may increase with increases in Y at some levels 
and decrease at others, and that while the value of X may be the same for 
two widely separated values of Y, it may differ for intermediate values. 
In that case, we would wish to represent in our experiment a number of 
levels of Y throughout the range in which we are interested. 

Sometimes we may be quite sure that the relationship between X and Y is 
monotonic. We may wish nevertheless to represent several levels of Y in our 
experiment, since our purpose may be not only to demonstrate that X isa 
function of Y but also to describe the nature of the relationship. For example, 
we might wish to determine if the relationship is linear, in which case we 
would have to represent at least three levels of Y. In this case, on the as- 
sumntion that the relationship is monotonic, we would do best to test the 


100 THE SIMPLE-RANDOMIZED DESIGN 


over-all null hypothesis by means of a t-test of the difference between the 
criterion means for the two extreme levels rather than by means of the over-all 
F-test. The t-test based upon the difference in the means of the extreme levels 
of Y is more likely to reveal the presence of a relationship than is the over-all 
F-test; that is, the t-test is a more powerful test for the presence of a relation- 
ship. The decision to make this test, however, should be made before ex- 
amination of the experimental data, and should not be suggested by the data. 

Experiments designed not only to detect the presence of a relationship, 
but also to determine the nature of the relationship between the experimental 
and criterion variables constitute a very important class of psychological 
experiments. Such experiments may employ the simple-randomized design 
or any of the more complex designs to be considered later. Because of its 
importance, separate consideration will be given to this class of experiments 
in Chapter 15, after all of the basic designs which might be employed in such 
experiments have been introduced. 


Applications of the Simple-Randomized Design to Observational Data 


In subsequent discussions it will be useful to distinguish between “ experi- 
ments" and "investigations" — between “experimental data" and “ob- 
servational data." An experiment usually involves the administration of 
treatments to groups that have been specially constituted by the experimenter 
for the purposes of the experiment, and the analysis of “effects” that have 
been produced or induced in the subjects during the course of the experiment. 
In contrast to this, we will, for present purposes, define an “investigation ” 
as a study in which observations are made of effects that are already present 
in a real population; an investigation is thus usually to be described as a 
sampling study or as a normative study. “Observational data" are defined 
as data collected in an investigation or a sampling study. "This arbitrary dis- 
tinction in these terms is not consistent with their general meanings, but if 
its purpose is understood, the distinction should cause no confusion, and it 
will result in considerable convenience and economy in reference in later 
discussions. 

Many of the designs to be considered in this text may be applied to “ob- 
servational" data as well as to "experimental" data, and consideration 
will be given to applications of both types. The methods of analysis of 
variance appropriate with the simple-randomized design may be employed 
whenever one wishes to determine if the sub populations of a given population 
differ in the mean value of some criterion variable. Suppose, for example, 
that one wishes to determine if persons of different religious affiliations 
differ in their mean response to a test of “cynicism.” ! One might then 
select a random sample from each “religious affiliation" sub population, ad- 


1 This example was suggested by a study by Neidt and Fritz in Educational and Psy- 
chological Measurement, vol. 10 (Winter, 1950), p. 4. 


STUDY EXERCISES 101 


minister the test to the samples, and apply the method of analysis of vari- 
ance to test the differences among the group means just as with a simple- 
randomized design. In this case, it matters little, so far as the tests of 
significance or the interpretation of results are concerned, what sizes of 
samples are drawn from the subpopulations. One might select the same size 
sample from each subpopulation or draw the entire sample at random from the 
parent population and then divide the subjects into subgroups after the 
initial sampling has been made, or follow still other procedures. 

As an additional example of the application of a simple-randomized design 
to observational data, suppose we wish to determine if the high schools in 
a certain state differ significantly in the quality of students sent to the state 
university. The criterion measure in this study might be the grade point 
averages earned by the students during their first semester at the university. 
'The entire freshman class at the university might be subdivided into groups 
corresponding to individual high schools (excluding high schools not con- 
tributing more than one student) and the methods of analysis of variance 
applied to these groups just as to the treatment groups in a simple-randomized 
design. The ratio of the mean square for between-schools to that for within- 
schools would then provide the basis for the test of the null hypotheses that 
the schools are alike in mean quality of student. 

The application of the simple-randomized design to observational data 
presents no problems not already considered with reference to experimental 
data. As we shall see later, however, the applications of other basic designs 
to observational data do present problems of their own, concerned primarily 
with the relations among the subgroup numbers, and with bias in sampling.! 
Furthermore, the interpretation of the results is usually more difficult in an 
observational study of effects already present than in a controlled experiment. 
As already implied, this is primarily due to the lack of positive control over 
extraneous factors. In the “religious affiliation" example, for instance, it is 
very difficult, if not impossible, to be sure that the observed differences in 
cynicism are due at all to the religious affiliations, rather than to something 
else which happened to be associated with them. 1t is also frequently more 
difficult to define an effect already present than one imposed by the experi- 
menter. 


STUDY EXERCISES? 


1. An experiment carried out by Grice and Saltz è involved comparisons of 
amounts of stimulus generalization in the white rat. Animals were trained to 


1 See Eli S. Marks, “Selective Sampling in Psychological Research," Psychological 
Bulletin, vol. 44, no. 3 (May, 1947), pp. 267-275. 

? See second paragraph on page viii. 

3G. R. Grice and Eli Saltz, “The Generalization of an Instrumental Response to 
Stimuli Varying in the Size Dimension,” Journal of Experimental Psychology, vol. 
40 (December, 1950), pp. 702-708. 


102 THE SIMPLE-RANDOMIZED DESIGN 


make a simple instrumental response to white circles of a particular size, and 
then separate groups were tested for extent of generalization to circles of 
differing sizes. 

The subjects were 80 experimentally naive albino rats randomly selected 
from the available colony. The ages of the animals ranged from 70 to 110 days 
at the beginning of the experiment. 

The size stimuli employed in this experiment were white circles cut from 
sheet metal with a small, square, hanging door in the center. The door was 
flush with the surface of the disc and could be pushed open by the rat's nose. 
Attached to the back of the disc, just below the door, was the food dish. The 
rat could easily obtain the food by pushing its nose through the door. 

The apparatus is shown in the diagram below: 


alley 
£ ; S 
1 1 | stimulus circle 
i | ! 
start ¢<o | 
1 1 if 
1 i 1 


food cup 


(Dotted lines represent vertical sliding doors.) 


The alley is mounted on a pivot at the center (O) so that each end could 
become in turn the starting end and the reaction end. By rotating the alley 
after each trial, it was unnecessary to handle the animals except at the begin- 
ning and end of each experimental session. The circle was mounted at the 
front of a 2-inch continuation of the alley, which could be slid back to per- 
mit rotation of the alley. 

All of the animals were given two days of preliminary training in learning to 
obtain a food pellet by opening the small door in the center of the stimulus 
circle. Five animals which did not learn to open the door during this pre- 
liminary training period were eliminated. The circle used in the preliminary 
training was the same as the one to be used in the subsequent reinforcement 
training. The purpose of the preliminary training period, then, was simply to 
select rats that could be used in the experiment. 

Following the preliminary training period, the rats were given 20 reinforced 
(rewarded) trials per day for three days on a 79 sq. cm. stimulus circle. As 
soon as the food pellet was obtained, the vertical sliding door in front of the 
stimulus circle was lowered, and the animal was allowed to eat the food before 
the alley was rotated for the next trial. 

After this reinforcement training, the animals were assigned at random to 
five groups of 15 each. Each of these groups was given a series of 25 extinction 
(non-rewarded) trials on a different size circle — i.e., one group was extin- 
guished on a 79 sq. cm. circle, another on a 63 sq. cm. circle, a third on a 50 
sq. cm., the fourth on a 32 sq. cm., and the fifth on a 20 sq. cm. circle. On these 
trials, the door in the circle under test was locked so that it would open only 


1 inch, and there was no food in the food dish. Latencies were recorded as the 
time from raising the center door to contact of the rat's nose with the 
door in the stimulus circle. If the animal failed to respond in 60 seconds, the 
trial was scored as a failure of response. The measure of generalization em- 
ployed was the number of responses made during the series of 25 extinction lr ials. 
The criterion measures, together with a part of the computations needed in the 


STUDY EXERCISES 


analysis, are presented below: 


Numbers of Responses in 25 Extinction Trials 


Area of Test Circle 


19 63 50 32 20 
15 16 5 9 T7 
18 8 10 8 T 
9 12 10 8 0 
1l 5 Z 14 8 
13 9 17 5 il 
20 10 17 11 7 
9 12 1l 9 9 
13 8 7 16 9 
5 11 T T 2 
10 18 6 4 6 
22 12 6 9 8 
18 8 5 8 3 
17 11 rf 8 1 
10 10 4 8 0 
12 12 9 10 0 
15 15 15 
202 162 128 
13.5 10.8 8.5 
3036 1896 1314 
2120 1750 1092 
Summary Table 
Source df ss ms 
Circle Size 


Within-Groups (w) 


Total 


104 THE SIMPLE-RANDOMIZED DESIGN 


a) Complete the computations and fill in the summary table. 


b) Define carefully the treatment populations concerning which inferences 
may be drawn from this experiment by the logic of statistical inference. 
Distinguish between the “parent” population and the treatment popu- 
lations. 


c) Compute the ratio of the mean square for circle size (the between groups 
mean square) to the within-groups (error) mean square. State the 
hypothesis to be tested by this ratio. What F is required for significance 
at the 195 level? May the hypothesis be rejected? 


d) Tabulate the frequency distribution of criterion measures for each of the 
treatment groups. Does an inspection of these distributions suggest a 
sufficiently marked departure from normality to affect the F-distribution 
appreciably? A sufficiently marked heterogeneity of variance? Ex- 
plain. 


e) What effect does the rotating of the alley between trials, as contrasted to 
the alternative of handling the rats between trials, have on the precision 
of the experiment? Why? Would the F-test be equally valid if this 
feature were lacking? 


f) Establish the “critical difference" between two treatment means. Pro- 
vide a table of differences, indicating those that are significant. 


g) The data in the table of differences suggest the hypothesis that the treat- 
ment means tend to decrease as the area of the test circle decreases. Why 
may one not conclude from the F-test of significance in (c) above that this 
hypothesis is true? 


2. In an experiment concerned with the relative effectiveness of certain 
incentives on schoolwork, Hurlock ! divided the 48 pupils in a class in Grade 
IV of a particular school into four random groups of 12 pupils each. Five 
equivalent forms (A, B, C, D, and E) of a 30-item addition test were prepared. 
On the first day of the experiment, Form A was administered to all the pupils 
in the class at one sitting. On the following four days one group (Control) was 
separated from the rest of the class during the experimental period and took a 
different form of the test on each day. The administration of the tests to the 
control group was handled by the regular teacher. The only direction given to 
the control pupils was to “work as usual.” 

The other three groups remained together during the experimental periods 
and each day took a different form of the addition test, administered by the 
experimenter. Each day before the tests were distributed, the pupils in one 
group (Praised) were called to the front of the room and praised for the excel- 


1E. B. Hurlock, “An Evaluation of Certain Incentives Used in Schoolwork,” 
Journal of Educalional Psychology, vol. 16 (March, 1925), pp. 145-149. 


STUDY EXERCISES 105 


lence of their work on the preceding day, and for their general superiority over 
the rest of the class. Following this, the pupils in another group (Reproved) 
came forward and were reproved for their poor work on the preceding day and 
for their general inferiority. The third group (Ignored) heard the comments 
made to the Praised and Reproved groups but received no specific recognition. 
The criterion measure was the number of correct answers on Form E of the 
addition test administered on the fifth (last) day of the experiment. The 
original criterion scores are not provided, but the sums of these scores and of 
their squares for each treatment group are presented in the table below. 


Ai A As As 


Praised Rented Ignored Control 
n; 12 12 12 12 
T; 195.12 135.12 105.48 109.68 
M; 
yx 3503.40 1927.92 1071.60 1178.52 
Ti/ni 


2) Analyze the results and prepare a summary table of the analysis. 


b) Suggest some of the bases on which, in an actual situation like this, 
you would decide what risk you are willing to take of rejecting a true 
hypothesis. Suppose you decided on a 5% risk; what is the value of 
F (= msa/ms,) at the corresponding point in the F-distribution? State 
explicitly the hypothesis tested by this F-ratio. May the hypothesis be 
retained? Suppose the F had been non-significant. Could this be re- 
garded as evidence that the treatments do not differ? Why? 


Define the parent population to which any statistical inferences from this 
experiment should be restricted. Suggest some arguments (non-statisti- 
cal) by which the conclusions might be extended to fourth graders in 
general. 


d) Describe the specific operations by which you would go about dividing 
the pupils in an experiment like this into four random groups of equal 
size. 


Suggest some specific Type G errors that might have affected the results 
of this experiment. What effect do such errors have on the interpreta- 
tion of the F-test as a test of the hypothesis of no treatment effect? Under 
what experimental conditions may one safely infer from a significant F 
that the treatments differ? 


f) How large must the difference between any two treatment means be in 
order to be significant at the 5% level? Why would it be inconsistent to 


c, 


NT 


e 


2 


106 THE SIMPLE-RANDOMIZED DESIGN 


ask, “What is the level of significance of the largest difference?" Do 
these results suggest that the over-all F may sometimes be significant 
because of the effect of a single treatment? 


3. One purpose of an investigation carried out by Baten and Hatcher ! was 
to compare the achievement in home economics classes in different high schools 
in a particular city. A four-week unit in consumer buying was taught to the 
sophomore home economics classes in four selected high schools. The classes 
were taught by the regular teachers who had had in common a short period of 
instruction regarding methods and objectives. At the close of the unit, all the 
classes took an objective test over the material covered. The scores on this 
test served as the criterion measures. The scores were as follows: 


School 

A B [^] D 
39 40 31 35 49 42 39 27 38 43 40 
32 46 43 40 42 34 39 44 45 45 
42 31 32 397 44 40 35 40 35 35 
82 31 38 40 42 27 29 53 41 
40 38 47 27 40 37 43 45 45 
36 32 36 39 40 47 45 39 45 
30 32 44 39 40 39 47 46 


a) Analyze the scores and prepare a summary table of the results. 


b) At the 1% level, what value of F is required for rejection of the null 
hypothesis regarding the population means? May we reject this hypoth- 
esis at thislevel? Describe the populations involved as accurately as you 
can. 


c) An assumption of random sampling underlies the F-test of question (b) 
preceding. What is the population from which the 17 pupils in School A 
were presumably a random sample? 


d) Comment briefly on the assumptions of normality and homogeneity as 
they apply in this situation. Specify carefully the distributions to which 
the assumptions apply and suggest some possible a priori arguments for 
or against the assumptions. In view of the results of the Norton study, 
need one be concerned about the validity of the usual F-test in this situa- 
tion? Explain. 

e) Does the outcome of the test of question (b) permit one to conclude that 
home economics is not “equally well taught" in these four schools? Ex- 


1 William D. Baten, and Hazel M. Hatcher, “Testing for Grade and School Differ- 
ences Among the Scores of Home Economies Students," Journal of Experimental Edu- 


calion, vol. 16 (March, 1948), pp. 176-180. 


STUDY EXERCISES 107 


plain, showing how this situation differs from the usual experimental 
situation. 


f) Following the procedure explained on page 95, indicate which of the six 
differences between school means are significant and which are non- 
significant. 


g) The difference between the À and D means is significant. On what rea- 
soning is this result nevertheless consistent with the hypothesis that the 
teaching of home economics is equally effective in these two schools? 


Analysis of Variance in Double-Entry Tables 


Introduction 


There are a number of different basic designs in which the criterion 
measures can be presented in a double-entry table. The columns in the table 
may correspond to the different treatments in a certain treatment classifi- 
cation, and the rows may correspond to levels, or to subjects, or to replica- 
tions, or to the categories in a second treatment classification, depending 
on the design involved. In all such designs, the total sum of squares may be 
analyzed into either three or four components, depending on whether only one 
or several measures are contained in each cell of the table. The analysis 
of the total sum of squares is exactly the same no matter what the rows or 
columns may represent. Accordingly, we shall first show in the most general 
terms (rows and columns) how the total sum of squares in a double-entry 
table may be analyzed into its components. The specific meanings of these 


Columns 


108 


INTRODUCTION 109 


components with reference to various basic designs will then be considered 
in later chapters. 

Notation: Suppose we have a total sample of NV measures which is con- 
stituted of c separate groups, each divided into r subgroups — corresponding 
subgroup frequencies being in the same proporlion for all groups (that is, the 
ratio between any two cell frequencies in one column being the same as 
the ratio of the corresponding cell frequencies in any other column). "These 
measures could then be presented in a double-entry table, the columns cor- 
responding to the groups, and the cells to the subgroups, as represented on 
page 108. The notation we will employ is as follows: 


X — any measure 
r = number of rows 
c = number of columns 
ni; = number of measures in the ith cell of the jth column. The first sub 
script always represents a row, the second a column. 


ny, = ons; = number of measures in row i 
j=l 


r 
ng = ni 


i=l 


number of measures in column j 


(The dots are needed to indicate whether the subscript accompanying the 
dot represents a row or a column. For example, n.s represents the number of 
measures in the third column, while ns. represents the number in the third 
row.) 


re 

N= Pn; = J Jn; = tota number of measures. (J, represents a 
ij i=l j=l 7 
double summation, and may be read “sum for all values of i and J.") 
ni 

Ti; = JOX = sum of measures in cell ij 
ni, 

T;. = OX = sum of measures in row i 
T 


T.; = 9,X = sum of measures in column j 


nij roc nij 
Ts DYK = > JX = sum of all measures 
7 i=l j=l 


Ma D mean Ob cally 
nij 


1 ; 5 
Mi = 5T. = mean of measures in row i 
d. 
1 " 
Mj- xis — mean of column j 
d 
e v "ij 


M- week = general mean 


j=l t= 


110 ANALYSIS OF VARIANCE IN DOUBLE-ENTRY TABLES 


Analysis of Total Sum of Squares into Four Components 


(Method of Arithmetic Corrections) 


To help the student develop the clearest possible understanding of the 
analysis of the total sum of squares, two different proofs will be presented. 
The first involves successive applications of the method of analysis into two 
components, together with arithmetic corrections applied to the individual 
measures. The other is algebraic in character. The arithmetic proof will be 
presented first. 

We have already seen that in any sample consisting of a number of separate 
groups, the total sum of squares may be analyzed into two components, a 
between-groups and a within-groups component. (See pages 55-56.) "There 
are three different ways in which we can regard the data in the preceding table 
as divided into groups. First, we can disregard the columns and consider the 
measures as divided into just r groups, corresponding to rows. Thus viewed, 
the total sum of squares can be analyzed into its between-rows and within- 
rows components. Second, we can disregard rows and regard the measures 
as divided into just c groups, corresponding to columns, in which case the 
total sum of squares may be analyzed into its between-columns and within- 
columns components. Finally, we can regard the whole sample as consisting 
of re groups corresponding to cells, in which the total sum of squares may be 
analyzed into its between-cells and within-cells components. Thus, 


SST = SSR + SSuR 
SSr = SSC + SSuc (23) 
SST = SScens F SSw cells 


Now suppose that in each row the deviation of the row mean from the 
general mean is subtracted from each individual measure in the row. For 
instance, if the mean of Row 3 is 5 units above the general mean, then 5 will 
be subtracted from every measure in Row 3. Similarly, if the mean of Row 4 
is 2 units below the general mean, then 2 will be added to every measure in 
Row 4. For each row then, the mean of the corrected measures will be the 
same as the general mean of the uncorrected measures. That is, there will 
be no differences among the row means in the table of corrected measures, 
and hence, no between-rows sum of squares in the corrected table. If we let 
a prime (ss’) on the symbol for a sum of squares indicate that it is based on 
the corrected measures, we may then write 


SST = SSR + SSoR = SSun- 
But the addition of a constant to every measure within a given row will have 
no effect upon the differences among the measures in that row. Accordingly, 
SSuR = SSoR 


and the total sum of squares in the corrected table is the same as that for 
within-rows in the original table. That is, 


SSwr = SSP. (24) 


ANALYSIS OF TOTAL SUM OF SQUARES 111 


The total sum of squares in the corrected table can now be analyzed into 
its between-columns and within-columns components. 
SST = SSC + SSwo- (25) 


From (23), (24) and (25), we may then write 
SSr = SSR + SSC + SSoc- (26) 


We may note next that, since the cell frequencies are proportional from column 
to column, the corrections applied to rows do not affect the column means. 
This point is basic to the analysis and should be thoroughly understood. 
Let a; represent the constant correction to all measures in Row i, in which 
case dini. is the sum of the corrections in Row i. Now we know that 


Lan: = 0, since the total effect of all corrections in all rows is to leave 


the general mean unchanged. We know also that corresponding cell fre-: 
quencies are in the same proportion from row to row, that is, we know 
n;j/n;, = k.; is a constant for all cells in Column J, and hence, nj; = ni.k.;. 


1f Xuan. =0 for the table as a whole, then Danek; = 0 also. But 


Xanh; = Lanin which is the sum of the corrections for Column j. Since 


7 r 
this sum is equal to zero, the column mean is unchanged by the corrections. 
From this it is apparent that 


$8c = SSe- 
Hence, (26) may be written 
$$p = SSR + $8c + SSuc. 

Now suppose that for each column in the once-corrected table we subtract 
the deviation of the column mean from the general mean from each measure 
in that column. This would eliminate all differences among column means in 
the twice-corrected table. If we let a double prime (ss’’) indicate that a sum 
of squares is based on the twice-corrected measures, we may then write 

sse = 0 
from which it follows that 
ssf! = s8¢ + ssuc = S8wo- 

But, since a constant correction to all measures in a column does nof, affect 
the differences among those measures, 

$85C = SSwo- 
Hence, 

SS = SSuc 

and 


887 = SSr + SSe + SST. 


112 ANALYSIS OF VARIANCE IN DOUBLE-ENTRY TABLES 


The total sum of squares in the twice-corrected table may now be analyzed 
into its between-cells and within-cells components, 


SST = SSccns + 585 centes 
from which it follows that 
SST = SSR + SSc + SSos + SSi cette: (27) 


Since any correction applied to the measures in a cell is the same for all 
measures in that cell, the within-cells sum of squares must be the same in 
all three tables. That is, 


3 ate sey ME 
SSw cells = SS: cells = SSw cells- 


If we now let ssz« = $55; (read “sum of squares for R by C," or “... for 
rows by columns"), and if we let the sum of squares within-cells (ssw caus) 
be represented more simply by ss;, (27) may be rewritten 


SSr = $8g + SSe + SSgc + 8v. (28) 


Three of these four components of ss; may be computed from the original 
table and the fourth obtained as a residual. Thus the total sum of squares 
in the original table may be analyzed into four components without actually 
having to make any arithmetic corrections. 

The number of degrees of freedom for ssr, ssp, SSc, and ss, all of which 
are computed from the original table, are N — 1, r — 1, c — 1, and N — rc, 
respectively. The number of degrees of freedom for cells, however, is not the 
same for the twice-corrected as for the original table. In the original table, 
this number of degrees of freedom is rc — 1, but one degree of freedom has 
been lost for each row but the last in the first correction (see page 52), and 
one for each column but the last in the second correction. Hence, the number 
of degrees of freedom for cells in the twice-corrected table, that is, for sszc, is 


re-1—(r-1) - (c-1) -re-r-cc1- (r- Y(c- 1). 


The sum of the degrees of freedom for the various components must, of 
course, equal the degrees of freedom for total. The student may check for 
himself to see that this is true here. 


Analysis of Total Sum of Squares into Four Components (Algebraic Method) 


We shall now prove again, this time by an algebraic method, that the 
total sum of squares in any double-entry table may be analyzed into four com- 
ponents. Students who are adept at algebra will find it worth while to work 
through this proof. Others can perhaps afford to rely entirely on the arith- 
metic proof of the preceding section. 

Using the notation on page 109, we may first write the identity 


(X-M)- (X — Mi) + (M.— M) + (M.; — M) + 
(Mi; — M) - (M;. — M) - (M.; - M). 


ANALYSIS OF TOTAL SUM OF SQUARES 113 


We will now let 


a= (X - Mi) 
b = (Mi.— M) 
d= (M.; — M) 


e = [((Mi;; — M) - (M;.— M) - (M.; - M) 
= (Mj; — M;.— M.;+ M). 
With reference to the method of arithmetic corrections, e represents the de- 
viation of a twice-corrected cell mean from the general mean, the two corrections 
(M;, — M) and (M.; — M), having both been subtracted from the deviation 
of the original cell mean from the general mean. In this notation, then, 
(X-M)=a+b+d+e. 

Squaring each member of this identity, 

(X — My = d +b + d+ e+ 2ab + 2ad + 2ae + 2bd + 2be + 2de. 

nij 

Summing over the n;; measures in cell ij (and noting that da = 0), 

nij nij 
EX- My = Yid + nib + nad + nue + 2nibd + 2n;jbe + 2n;;de. 
Summing for the c cells in Row i, 


e "ij ce "ij 


ZLA -My-2» 


j=l 


a + nb’ + nagd te Enue 4r 23 n;jbd 
j=l pesi Jol 
+2) nibe + 2J ni;de. 
pst p 
Now, letting k = the constant ratio of n.;/nij, 
2Y nbd = Dnd = Yn (M, - M) - 0. 
j=l j=l kia 
Hence, 
coh e nij e € e 
SD (X- Mf = DD + 1b? + Znad + Denise’ + 22 nijbe 
ji E f= = ns 
T 2» nd. 
j=l 
Summing for the r rows, 
rf c ñij p 6d r r € hs 
XXXCX-My- 14 F ina D Lnd + Drie” 
meri £i = ae rm 
4 2» nibe + 25^ Y nde. 
f=1 j=l i=l j=l 


Now 


c 


End = Ld’ Lonii = Ln. 
5 Z 


$21 j=l j=l 


114 ANALYSIS OF VARIANCE IN DOUBLE-ENTRY TABLES 


and 
L nibe = $ngM - Mi.) - (M.; - M)] 
i=l j=l ij 
= Là = Mia »». 
m Ga 
T mM; —M;.)-0 
= Lo Xni; Mi — Mi) 
=0. 
Similarly, 
YY nde =Q, 
i=l j=l 
Hence, 

r o nij Fait Thay. E - TM 
2:223 - M - 2212 Enb + Dad + D Enue (29) 
i=l je) fel j=) i=l j=1 i=l j=l 

roc Dij K 
= apap Ae.¢ o2 May + $n. (Mi. D» My 
i=l j=l i=l 
+ Xn4M;- My + 3X nuMa — Mi — Mag + M) 
j=l i=l j=l 
or 


S87 = SSw + SSR + SS + SSRC- 


The Case of One Observation per Cell 


Let us now consider briefly the special case in which only one observation 
has been recorded in each cell, or in which n:;=1. This is frequently the 
case in trealments X subjects designs, in which “rows” correspond to “sub- 
jects," and in which only one criterion measure is obtained for each subject 
under each treatment. Whenever there is only one observation per cell, 
there can of course be no within-cells sum of squares, and the sum of squares 
for cells is the total sum of squares. Hence, the total sum of squares is analyzed 
into three components, and the interaction sum of squares (ssec) is com- 
puted as a residual by subtracting the sums of squares for rows and for 
columns from that for total. 


Computational Procedure 


We have seen that in any double-entry table the analysis of the total 
sum of squares into its four components is accomplished essentially by suc- 


A WORKED EXAMPLE 115 


‘cessive applications of the method of analysis into two components. Thus 
the basic computational formulas already provided, Formulas (8) and (9), 
are all that are needed. The sum of squares for between-rows is obtained 
by squaring each row total and dividing by the number of cases in the row, 
summing these terms and subtracting the square of the grand total divided by 
the total number, that is 


The sums of squares for columns and for cells are similarly obtained, and the 
sum of squares for interaction is obtained as a residual. "These computational 
procedures are summarized in the table below. Ordinarily, the results actually 
obtained in an analysis are summarized in a similar table. (An example is 
given on page 117.) 


Source of 
Variation df 5 nes 
Columns (C) | c—1 $$c = 2 Tn. — TN. msc = 
ars sso/ (c — 1) 
Rows (R) r—1 ssr = D Ti /ni.— TYN |msr= 
E ssr/ (r — 1) 
(SSceus = 
(Cells) (re — 1) YETI /ns — TYN) 
i=l j=l 
Rows X (r— 1) (c — D)| s8ro = SSæns — SSR — SSe. [Msro = 
Columns(RC) ssrc/ (e—1)(r—1) 
Within- N — rc $8, = SST — SScens ms, = 
Cells (w) Ssu/ (N — re) 
e in "£g 
Total N-1 ssr = DLX - T'N 
£c 


A Worked Example 


The following worked example will make clear the application of the 


computational procedures just described. 
The example is based on the data in the double-entry table presented below, 
with five measures in each cell. The individual measures are presented in 


116 ANALYSIS OF VARIANCE IN DOUBLE-ENTRY TABLES 


the column along the left margin of each cell. The figure to the right in 
each cell represents the cell mean. 


Col 1 Col 2 Col 3 Col 4 Row 
Means 
” 
34 16 30 38 
19 35 23 32 
Row 1 24 27.60 | 18 19.40 | 39 29.00 | 21 28.40 26.10 
36 16 29 36 
25 12 24 15 
1 41 28 48 
24 30 34 25 
Row 2 12 25.40 | 17 28.60 | 40 34.20 | 29 30.40 29.65 
43 31 27 22 
4l 24 42 28 
39 19 20 30 
25 49 36 23 
Row 3 40 40.00 | 30 34.40 | 42 32.60 | 24 30.40 34.35 
57 46 12 38 
39 28 53 37 
Col Means 31.00 27.47 31.93 29.73 30.03 
General 
Mean 


The basic computations are presented on the following page. For instance, 
for the first cell in the first row, 34 + 19 + 24 + 36 + 25 = 138. Likewise, 
342 + 19? + 24? + 36? + 25? = 4014, and finally, 1382/5 = 3808.80. The en- 
tries for the other cells are similarly computed. 

The rest of the example should be self-explanatory. 

It would not have been necessary in this case to compute 77;/n;; for each 
cell individually. In general, when n;; is constant for all cells in a row (or 
column), it is simpler to compute 75; for each cell in the row (or column), add 
these squared sums for all cells in the row (or column) and then divide by the 
constant n;; for that row (or column). When n;; is constant throughout the 
entire table, as in this example, it would have been simpler to add the squared 
cell totals for all cells in the table and then divide this sum by the constant 
n;;inthiscase5. This could be done on an automatic computing machine by 
cumulating the squared cell totals as they are squared and without having to 
record any individual T7;. 


A WORKED EXAMPLE 117 


Col 1 Col9 ^ Godl3 Col4 HOS 


Row 1 Ti; 138 97 145 142 522 13,624.20 
DX? 4014.00 2205.00 4367.00 4430.00 
T/n,; 3808.80 1881.80 4205.00 4032.80 


Row 2 Ti; 127 143 171 152 593 17,582.45 
2X? 4299.00 4407.00 6033.00 5038.00 
Tin; 3225.80 4089.80 5848.20 4620.80 


Row 3 Ti; 200 172 163 152 687 23,598.45 
zx? 8516.00 6562.00 6413.00 4818.00 
Tin; 8000.00 5916.80 5313.80 4620.80 


T3 465 412 479 446 T= 1802 
T?;/n.; 14,415.00 11,316.26 15,296.07 13,261.07 


YY DN = 4014.00 + . . . + 4430.00 + 4299.00 +... + 5038.00 


VU + 8516.00 +... + 4818.00 = 61,102.00 
DDT, /ni; = 3808.80 + . . . + 4032.80 + 3225.80 + . . . + 4620.80 
aa + 8000.00 +. . . + 4620.80 = 55,564.40 


IT? /n:. = 13,624.20 + 17,582.45 + 23,598.45 = 54,805.10 


i=l 


YT /n.; = 14,415.00 + 11,316.26 + 15,296.07 + 13,261.07 = 54,288.40 
j-l 
T*/N = (1802)/60 = 54,120.07 


ssr = 61,102.00 — 54,120.07 = 6981.93 

ssp = 54,805.10 — 54,120.07 = 685.03 

sso = 54,288.40 — 54,120.07 = 168.33 
88cets = 55,564.40 — 54,120.07 = 1444.33 
ssro = 1444.33 — 685.03 — 168.33 = 590.97 

$8, = 6981.93 — 1444.33 = 5537.60 


Summary Table 


Source of Variation df E ms 


Columns (C) 3 168.33 56.11 
Rows (R) 2 685.03 | 34251 - 
(Cells) Q1 | (1444.33) | 
Rows X Columns (RC) 6 590.97 | 9849 - 
Within-Cells (w) 48 5537.60 | 115.36 - 


Total 59 6981.93 - 


118 ANALYSIS OF VARIANCE IN DOUBLE-ENTRY TABLES 
The Generalized Meaning of Interaction 


The mean square for interaction (mspc) plays an extremely important part 
in the interpretation of results for many basic experimental designs. lt will 
be well, therefore, before going further, to ascertain as exactly as possible 
just what this "interaction" represents in any double-entry table, or what 
may be said about it in the most highly generalized terms of rows, columns, 
and cells. 

We have seen (28) that the interaction sum of squares is the sum of the 
weighted squared deviations of the cell means from the general mean in the 
twice-corrected table, in which all differences among row means and column 
means have been eliminated. 1 

The twice-corrected mean of cell ij is [M;; — (Mi, — M) — (M.; - M)] = 
[((M;; — M;. — M.;4- 2M)] and the deviation of this twice-corrected mean 
from the general mean is (Mi; — M;.— M.;4 M). The quantity (M;; — 
M; — M.;+ M) may be termed the “interaction effect" for cell ij. In the 
illustrative exercise (page 116), for example, the interaction effect for the 
second cell in the second row is (Ms — Ms. — M.» + M) = (28.60 — 29.65 — 
27.47 + 30.03) = +1.51. The interaction effects for the entire table are given 
below. The weighted sum of the squares of these interaction effects is 
5(-+0.53)? + 5(—4.14)? +... + 5(—3.65)? = 590.97 which is the same as the 
sum of squares for interaction computed on page 117. 


+0.53 | —4.14 | +1.00 | +2.60 

-522 | 4151 | +265 | +105 
I— | 

+4.68 | +261 -3.65 | —3.65 


If the interaction effect for every cell in the table is equal to zero, that is, 
if all the twice-corrected cell means are equal to the general mean, the ob- 
served interaction for the table as a whole will of course be zero. The more 
variable the interaction effects for the individual cells, the greater is the 
interaction mean square, or the greater is the observed interaction for the 
table as a whole. 

If the measures in all cells of the table were simple random samples from 
the same population, that is, if there were no real differences among rows or 
among columns or among cells, we would still expect the twice-corrected cell 
means to differ from one another for no other reason than that of chance fluc- 
tuations in the means of simple random samples. That is, we would never 
expect the observed interaction effect (the interaction mean square) to be zero. 
In any practical application of this analysis, then, one would want to know 
whether or not the observed interaction effect can be entirely accounted for 
in terms of sampling fluctuations alone. In other words, one would want to 


THE GENERALIZED MEANING OF INTERACTION 119 


test the significance of the observed interaction. Ways of doing this will be 
considered later. 

While the interaction effect is defined basically in terms of the twice- 
corrected cell means, it may be described also in terms of the relationships 
among the original or uncorrected means. Suppose that any two columns are 
selected from the complete r X c table and that a new table is formed from 
these two columns alone. The total sum of squares for this two-column table 
may then be analyzed into its rows, columns, rows X columns, and within-cells 
components. The interaction sum of squares for this table is given (29) by 

ssre = 222a - Mj - (M.;- Mf 


which, in a two-column table with constant ni; = n, reduces to 
T 2 2 
SSro = idw. — Mi)- (Ma- w| 3i [or — Mi)- (Ma— m|} 
_ c [(Ma-Ma_Ma- Mey (Ma -Ma _ Mı- ma`] 
F Sl ( 2 Ug Md a REN 2 


- 23 [Ma - Ma) - (Ma - Ma)? 


nx = 
s 3220. -D 


in which M. and M. are the means of the two columns, D;. = (Ma — M. P 
and D = Ma — Ms. 

Thus it is apparent that the mean square for interaction in the two-column 
table depends upon the variability of the differences between cell means for 
the various rows of the table. If all these differences are the same, the inter- 
action mean square will be zero. Even if only one of these differences differs 
from the others there will be an interaction sum of squares. 

In the table on page 116, for example, the difference in the first two cell 
means in Row 1 is 27.60 — 19.40 = 8.20 and the difference in the correspond- 
ing cell means in Row 2 is 25.40 — 28.60 = —3.20. We need not go further 
to say that there is an observed interaction in the table as a whole, since 
if the difference between any two cell means in any row differs from the 
corresponding difference in any other row, the mean square for interaction 
will have a value other than zero. The difference for Row 3 is 40.00 — 34.40 
= 5.60, and the mean of the three differences is 31.00 — 27.47 = 3.53. 

The interaction sum of squares for Columns 1 and 2 of the table is 


z EO. — Dy = 5/2 [(8.20 — 3.53)" + (-3.20 — 3.53)?-- (5.60 — 3.53)] 
= 5/2 (11.39) = 178.48 


and the mean square for interaction for these two columns is 89.24. Similarly 
the mean square for interaction for Columns 2 and 3 is 83.65, for Columns 1 


120 ANALYSIS OF VARIANCE IN DOUBLE-ENTRY TABLES 


and 3 is 164.45, for 1 and 4 is 141.25, for 2 and 4 is 106.05, and for 3 and 4 
is 6.40. The mean of these six mean squares is 98.51, which agrees with the 
value of the interaction mean square for the table as a whole. It is further 
apparent that the mean square for interaction computed for a certain pair of 
columns selected from the total table may differ markedly from that computed 
for some other pair of columns. 

We have just seen that in a table in which n;; is a constant, the mean 
square for interaction in the entire table is the simple average of the mean 
squares for interaction computed for all possible individual pairs of columns 
in that table. The same would be true if n;; were a constant within each row, 
even though it differed from row to row. 


Treatments X Levels Designs 


Generalized Case of the Treatments x Levels Designs 


The basic features of the treatments X levels design have already been con- 
sidered (pages 13-16) and the design illustrated in a simple restricted case. 
(The student should review these features carefully before proceeding with 
this chapter.) The major purpose of the design is to increase the precision 
of the treatment comparisons by “matching” the treatment groups with 
reference to a “control” variable related to the criterion variable. In the 
generalized case of this design, involving a number (a) of treatments, all 
available subjects (presumably either a random or a representative sample 
from some specified population) are divided into / different groups or “levels,” 
the numbers in these groups being in the same proportion as the numbers in 
the corresponding levels in the entire population. This division may be based 
either on a continuous control variable (Y) or on the categories (ordered. or 
non-ordered) in any classification applicable to the members of the population. 
In either case, in accordance with the purposes of the design, it is assumed 
that the levels differ appreciably in the mean value of the criterion variable 
(X), or that there is a substantial correlation between the control and cri- 
terion variables. Within each level, the subjects are assigned at random to 
the a subgroups corresponding to the treatments. Ordinarily, the same num- 
ber of subjects would be assigned to each treatment subgroup within each 
level. However, we will consider here the more highly generalized case in 
which the subgroup numbers vary within each level and from level to level, 
but in which the numbers for corresponding subgroups are in the same propor- 
lion for all levels. The criterion measures for the subjects may then be tabu- 
lated in a double-entry table of | rows and a columns, the rows corresponding 
to levels, the columns to treatments, and the cells to subgroups within the 
various levels. In the manner described on pages 110 to 114 the total sum of 
squares (ssr) for this table may then be analyzed into four components: treat- 
ments (ss4), levels (ssz), treatments X levels (ss47), and within-subgroups 
(ss), unless there is only one observation per cell, in which case ss, does 
not appear. 

121 


122 TREATMENTS X LEVELS DESIGNS 


The major purpose of an experiment of this type is usually to determine 
if the treatments would have different average effects on the members of the 
specified population. For this purpose, we would wish to test the null hypothe- 
sis that the population mean is the same for all treatments. Each such 
mean, of course, represents the average effect of the treatment for individuals 
in all “levels” in the control classification. This hypothesis does not imply 
that the relative effects of different treatments are the same for all levels; 
one treatment might be much more effective than another at one level but 
less effective or even inferior to the same treatment at another level. 

A second purpose of the experiment may be to determine whether or not the 
treatments do have the same relative effects at all levels, that is, to determine 
whether or not there is any “interaction” of treatments and levels. The 
corresponding hypothesis to be tested is that the differences among correspond- 
ing treatment population means are the same for all levels. If either or both 
of these general hypotheses must be rejected, we may become interested in 
various subordinate or more specific hypotheses concerned with individual 
pairs of treatments or with individual levels. 

For the purposes of subsequent discussions, it will be convenient to define 
the “effect” of a treatment somewhat differently from the way it was de- 
fined in Chapter 3. In Chapter 3 we were concerned with the effect of a 
given treatment on a single subject. We shall now be concerned with the 
effect of a treatment on a group of subjects or, on the corresponding “ treat- 
ment group.” In the simple-randomized design, we will now regard the 
“effect” of a single treatment as corresponding to the deviation of the treat- 
ment group mean from the general mean (M.; — M). If all treatment group 
means are the same, the observed "effect" of each treatment will be zero. 
We will also need to refer to the collective effects of all treatments. We will 
call this the “treatments” effect (note the plural) and will regard it as cor- 
responding to the mean square for treatments. Again, if all observed treat- 
ment means are the same, the observed treatments effect will be zero. 

The treatment effects in a treatment X levels design are similarly defined, 
except that we must distinguish between the effects of a treatment at a 
given level and its average effect for all levels. We will call the effect of 
a given treatment at a given level the “simple” effect of the treatment. 
The average effect of the treatment at all levels will be called the “main” 
effect of the treatment. A simple effect of a given treatment is associated with 
the deviation of the corresponding treatment subgroup mean from the mean 
for its level. For example, the simple effect of Treatment As at the third level 
is associated with (Ma — Ms). In the double-entry table, a simple effect 
is associated with the deviation of the corresponding cell mean from its 
row mean. The main effect of a treatment is associated with the deviation 
of the treatment mean from the general mean, that is, by the deviation of 
the column mean from the general mean (M.; — M). The main effect of a 
treatment is thus the weighted average of all its simple effects. In accordance 
with the preceding definitions, the main effect of treatments is associated with 


THE MEANING OF INTERACTION 123 


the mean square for treatments for the table as a whole, while a simple effect 
of treatments is associated with the mean square for treatments computed 
only for the data from the given level. 


The Analysis of the Total Sum of Squares 


The analysis of the total sum of squares in the treatments X levels design 
is exactly like that already described in general terms for all double-entry 
tables (pages 110—114), except that in our notation we will let A, L, a, and l 
take the place of C, R, c, and r respectively. The analysis may be summarized 
in a table like the following: 


Source of 
Variation df ss ms 
ar? T? 
Treatments (A) | a — 1 $84 = po -=F ssa/(a — 1) 
jn. 
l 2 2 
Levels (L) l-1 ssL = x -= I ssz/(l— 1) 
la 2 nA 
(Cells) (al — 1) (esc aD OA £j 
"mergi t MN 


Treatments X |(a— DX 
Levels (AL) (l-1) 


SSAL = $Sens — SSA — SSL | SSAL/(a — D(-1) 


Within- 


Suberoupa tU) iN — al $8, = SST — SScetis ss,/(N — al) 


a ^ij TA 


1 
Total N-1 sr = LX -N 


The Meaning of Interaction 


Before considering the other features of this design, it will be well to ex- 
plore more thoroughly the meaning of “interaction.” Let us first consider 
its meaning when there are just two treatments and two levels, as in the 


design diagrammed below. 


Lı 


124 TREATMENTS X LEVELS DESIGNS 


Tn this case, the observed interaction for the experimental sample is meas- 
ured by the difference between the differences between treatment means 
for the two levels. That is, it is measured by d = (Mi; — Miz) — (Ma — M»). 
We know that each of the subgroup means is subject to random sampling 
fluctuations; hence the observed interaction might be due entirely to such 
fluctuations. Before drawing any inferences about the population, therefore, 
we would wish to test the significance of the observed interaction. This could 
be done by the t-test (see page 15) 


t = d/est'd oy. 


(The more highly generalized F-test of the significance of the interaction in 
the multiple-treatment, multiple-level case which we shall consider later is 
essentially the equivalent of this t-test in the two-treatment, two-level case.) 

If, according to this t-test, the observed interaction is too large to be due 
entirely to chance, that is, if it is “significant,” we may conclude that there 
is an interaction in the population — granting that the population is subject to 
exactly the same influences as those affecting the experimental sample. That is, 
we may conclude that (ui — jus) 7£ (um — us), where the p’s represent popu- 
lation means. 

Intrinsic Versus Extrinsic Interaction: We may not conclude from a signifi- 
cant |, however, that the larger-than-chance interaction is necessarily due 
to differences in the relative effects of the treatments at the two levels. Part 
or all of the observed difference in subgroup means at the upper level 
(Mu — My) may be due, not to a difference in treatments, but to some ex- 
traneous factors associated with the treatments at that level in the experiment, 
and similarly for the lower level. That is, the observed d may be due in whole 
or in part to Type G errors. A significant ¢ only tells us that the observed d 
is too large to be reasonably accounted for entirely in terms of random Type S 
fluctuations. Only on the assumption that Type G errors have been com- 
pletely eliminated in the experiment (or equalized for the two levels), may we 
conclude from a significant £ that the difference in treatment effects is not the 
same at both levels in the population. 

In any experiment of this type, there are three possible components of the 
observed treatments X levels interaction, measured in this case by d (and in 
the general case by the interaction mean square). One component is due to 
Type S errors, one to Type G errors, and one to the treatments alone. That 
part due to treatments alone we will give the name “intrinsic” interaction; 
that due to Type G errors we will call “extrinsic” interaction. If the ob- 
served interaction is significant, we may conclude that something more 
than Type S errors is present, or that there is an interaction in the popu- 
lation. The fact of significance, however, does not enable us to say whether 
the interaction is intrinsic or extrinsic, or a mixture of both. In nearly all 
experiments of this type, a significant interaction is a mixture of intrinsic 
and extrinsic interaction, but there is no way of determining from the experi- 
ment alone what proportion is intrinsic and what extrinsic. 


THE MEANING OF INTERACTION 125 


Two Treatments — Several Levels: Let us now consider the case in which 
there are several levels but. still only two treatments, as diagrammed below. 


Ai En 


Ma Ms D-M;-M; 


The observed interaction now depends on the differences among the differ- 
ences between the treatment means for all levels; it is measured by the 
variability of these differences. If D;. = Ma — Ma represents the difference 
at level i, then the observed interaction depends on the variability of these 
Ds or, more strictly, on a variance estimate derived from these differences. 
If the number of cases in all of the treatment subgroups is the same (n), the 


l 
observed interaction is measured by 320. — Dy/(— 1), which is the 


mean square for interaction, D representing the mean of the D;.’s (see page 
119). Againit is possible that the observed interaction is due entirely to chance 
fluctuations in the subgroup means. As in the simpler case, therefore, we will 
wish to test the significance of the observed interaction before drawing any 
inference about the population. We shall see later how this may be done. 

The Mulliple-Treatment, Mulliple-Level Case: If there are more than two 
treatments as well as more than two levels, the observed interaction for the 
table as a whole may be regarded as a weighted average of the observed inter- 
actions for all possible pairs of treatments (columns) in the entire table. 
For example, if there are three treatments, we could compute the mean square 
for interaction from the data for Treatments Ai and A, alone, and likewise 
for Treatments Az and As alone, and A; and A; alone. If all subgroups are of 
the same size, the mean square for interaction for the entire table would be 
the simple mean of these three mean squares. The observed interaction for 
the entire table may also be regarded as depending on the variability of the 
interaction effects for the individual treatment subgroups (cells), that is, on 
the deviations of the twice-corrected cell means from the general mean, For 
many purposes, however, it is more meaningful to think of interaction as 
depending upon the variability of the differences between treatment means 
at the various levels for individual pairs of treatments. 

It may be helpful to regard the case of no interaction as that in which 
the effects of the treatments (and any associated extraneous factors) are 
“additive.” Suppose that before a certain treatment is administered, the 


126 TREATMENTS X LEVELS DESIGNS 


population mean of the criterion variable for each level is ascertained. This 
initial mean will of course differ from level to level, but within each level 
all subpopulation means will be the same. If the average effect of the treat- 
ment at each of these levels is the same, that is, if the effect is equivalent 
to having added a constant to all initial level means, we may say that the 
treatment effect is “additive.” If all treatments are additive in their effects, 
there will be no interaction in the table. That is, the effect of one treatment 
will be to add one constant to all initial level means for that treatment, 
and the effect of each other treatment will be to add another constant to 
all initial level means for that treatment. This will create differences in the 
final criterion means for any level, but the corresponding differences will 
be the same for all levels; hence there will be no interaction. 

It may also be helpful to define an "interaction effect" as the difference 
between a simple effect and the corresponding main effect. "The interaction 
effect for Treatment A; at the third level is thus the difference between the 
simple effect of A» at this level and the main effect of As, that is, (Mz: — Ms.) — 
(M5— M) = (Ma — Mi. — Ma-- M). If the simple effect of each treat- 
ment is the same at all levels, all interaction effects will be zero and there 
will be no observed interaction in the table. To say that there is an inter- 
action, then, is to say that the simple effects of all treatments are not the 
same at all levels. 

It may be well to note here that in all our discussions of interaction we 
should distinguish carefully between the "observed" interaction (which 
characterizes the experimental data) and the interaction which characterizes 
the population. We should also distinguish carefully between intrinsic and 
extrinsic interaction. Generally, when on the basis of the experimental re- 
sults we say without qualification that there “is an interaction," we imply 
that the observed interaction has been found to be significant, and our state- 
ment is an inference about the population. If we say without qualification 
that there is “no interaction," we imply that the observed interaction has been 
shown to be nonsignificant, and our statement is an assumplion or hypothesis 
about the population. Whenever we mean observed interaction, we shall say so. 

Helerogeneily of Interaction: When we say that there “is an interaction” 
in the table as a whole, we usually mean that somewhere in the entire table 
the difference between corresponding (observed) treatment means differs from 
level to level by more than is reasonably attributable to chance. This state- 
ment does not imply, however, that there are “larger-than-chance” differences 
everywhere in the table. 

If there are three treatments, it is possible that Treatments A; and As 
are both additive in their effects, but that the effect of Treatment As is not 
additive. In that case, there would be no interaction with levels so far as 
A, and A» are concerned, but there would be an interaction so far as A, and 
As, or A, and As, are concerned. Again, all three treatments may be non- 
additive in their effects, but Treatments 4; and A; may be much alike in their 
effects at all levels, while the relative effectiveness of A; may differ markedly 


CONSTITUTING THE "LEVELS" IN AN EXPERIMENT 127 


from level to level. In that case, we would say that there is a “weak” or 
“slight” interaction so far as A; and A; are concerned, but a strong interaction 
so far as A; and As, or A» and As, are concerned. That is, the variance of 
the differences between treatment means for the various levels may be small 
for A: and A; but large for A, and A; and for A, and A;. In that case, we 
would say that the interaction for the table as a whole is “heterogeneous,” 
or that the interaction is stronger for some pairs of treatments than for others. 
On the other hand, if the variance of the differences between treatment means 
for the various levels is the same for all possible pairs of treatments, we would 
say that the interaction for all treatments and levels considered together is 
homogeneous. Another way of describing heterogeneity of interaction is to 
say that the simple effects of some treatments are more variable than those 
for other treatments, or that the variance of the interaction effects is greater 
for some columns (treatments) than for others. This concept of heterogeneity 
or homogeneity of interaction, as we shall see, is a very important concept 
in the interpretation of results obtained in many experimental designs. 

The Uniformity Trial: It might be helpful, in distinguishing between in- 
trinsic and extrinsic interaction, to consider what might happen if a particular 
experiment were repeated with a fresh but equivalent sample of subjects 
under exactly the same experimental conditions, with the exception that in 
the repetition of the experiment, the same treatment (any one of the experi- 
mental treatments) was administered to all treatment groups. This is some- 
times described as a “uniformity trial.” In the uniformity trial, of course, 
there could not possibly be any intrinsic interaction, and whatever inter- 
action is observed would have to be due entirely to sampling fluctuations 
and experimental errors. If a much larger observed interaction were obtained 
in the actual experiment than in the uniformity trial, this would suggest the 
presence of an intrinsic interaction in the experiment. A test of the signifi. 
cance of the difference between the two observed interactions would then be 
a test of the hypothesis that there is no intrinsic interaction in the original 


experiment. This test could be made by F = msa 1/ms‘1, in which ms4z is 
the mean square for interaction obtained in the uniformity trial. This 
would be a one-tailed test like the test of ms4/msv. 

In a single experiment of the treatments X levels type, it is not possible 
to separate the intrinsic and extrinsic interactions, but it is possible to esti- 
mate what part of the observed interaction is due to sampling fluctuations 
(Type S error) only, and it is possible to determine whether or not the ob- 
served interaction can reasonably be attributed to sampling fluctuations 
alone. The test of significance appropriate to the purpose will be considered 
later (pages 138-139). 


Constituting the "Levels" in an Experiment 


In any application of the treatments X levels design, what inferences may 
legitimately be drawn from the experiment will depend in part on the manner 


128 TREATMENTS X LEVELS DESIGNS 


of constituting the levels in the experimental sample. This division of sub- 
jects into levels may be accomplished in at least two different ways. 

Representative Sampling from a Real Population: In some situations, the 
distribution of the control variable for the entire population may be known. 
In that case, it may be possible to draw a strictly representative sample from 
this population. A representative sample is here defined as one in which the 
numbers of subjects in the various levels or categories in the sample are 
exactly proportional to the numbers in the corresponding levels in the entire 
population, and in which the sampling is strictly random within each level 
independently. If the levels are based on a continuous control variable, the 
distribution of that variable may be broken up into as many intervals or levels 
as the experimenter desires. If the reason for introducing the control vari- 
able is solely to increase the precision of the experiment, the larger the num- 
ber of levels the better. However, the number of levels will be limited by 
the requirement that there be at least two subjects in each cell of the double- 
entry table, and by the total number of subjects to be employed. The 
intervals need not be of the same size. 

Suppose, for example, that in a learning experiment involving three treat- 
ments, the control variable is the score on a general intelligence test. The 
distribution of scores on this test for a very large sample taken from the 
entire population (the sample on which the test was standardized) is known 
to be that given in Column 2 in Table 6. Since the sample is so large, the dis- 
tribution in Column 2 may, for practical purposes, be regarded as the distri- 
bution for the entire population. Suppose the distribution of intelligence 
test scores for the available subjects is that given in Column 3 of the table, 
and it has been decided to employ approximately 100 subjects in the experi- 
ment. We might then divide the scale of intelligence test scores into a number 
of relatively coarse intervals, as indicated by the lines across the columns. 
These intervals are determined on a “cut and try” basis, in an effort to have 
the percent of individuals in the whole population that is contained in each 
interval as close as possible to a multiple of 3 (not less than 6). The popu- 
lation distribution for these intervals is given in Column 4. The number 
in parentheses following each frequency expresses this frequency as a percent 
of the total population. It will be noted that most of these percents are 
quite close to a multiple of 3. The distribution in these intervals for the 
available subjects is given in Column 5. For the purpose of the experiment, 
6 subjects are selected a£ random from the 25 subjects available in the first 
interval, 9 are selected at random from those in the second interval, 18 
at random from those in the third, etc. Accordingly, by comparing Col- 
umns 4 and 6, we note that the frequencies in the various intervals in the 
experimental sample are very nearly proportional to the corresponding per- 
cents in the entire population. The experimental sample may, in other 
words, be regarded as a representative sample so far as the distribution 
of intelligence test scores is concerned. 

This method of sampling is the only one strictly satisfying the require- 


CONSTITUTING THE "LEVELS" IN AN EXPERIMENT 129 


TABLE 6 


Distribution of Intelligence Test Scores for an Experimental Sample 
and for the Population Represented by the Sample 


(2) (3) 
Frequency Distribution 


a) 


[3] NON 
Distribution | Distribution 


Intelligence | Distribution for Distribution for for 
Test for the Available for Available Selected 
Scores Population Subjects Population Subjects Samples 


1 

4 

3 

1 

2 495 (4.8%) 25 6 
2 

3 

5 

6 

6 928 (9.1%) 20 9 
0 

9 1731 (16.9%) 31 18 


1525 (14.9%) 


2393 (23.4%) 


2115 (20.7%) 


1029 (10.1%) 52 9 


RONAN ERAS 


ments of the hypothesis to be tested, but for obvious practical reasons it 
can rarely be used. We will refer to this method of sampling as the method 
of representative sampling from a specified real population. 

The “Counting Off” Method: A more common situation is that subjects from 
some real population are available for experimental purposes, but it is not 
known that the available subjects are a strictly random sample from that 
population. In this situation, also, the distribution of the control variable 


130 TREATMENTS X LEVELS DESIGNS 


for the population may be unknown. In this case, we can regard the available 
subjects as a representative sample from a hypothelical population — one de- 
fined to fit the sample (see pages 73-74). The available subjects are first arranged 
in order of the control variable (arranging in random order all individuals 
having the same value of the control variable). The levels are then con- 
stituted by counting off na subjects at a time from the top of the distribution, 
n being larger than 1 but otherwise selected at the experimenter’s conven- 
ience. Thus, if there are 3 treatments, the first 6 subjects (those with the 6 
highest scores) might be those in the first level, the next 6 those in the second 
level, the third 6 those in the third, etc. All subjects can thus be used except 
the number less than 3 that may be left over after the last complete set of 6 
or 9 has been selected. Suppose, for example, that 47 subjects are available 
for the experiment and that their control measures are as indicated in the 
list below: 


25 21 36 27 16 4T 
34 35 33 51 24 18 
15 M 26 20 33 23 
46 15 55 10 Xx 16 
33 21 22 N 38 29 
29 19 Al 40 34 32 


Two subjects are eliminated at random in order to leave a multiple of 3. 
The control measures of those eliminated are crossed out in the list above. 
'The remaining subjects are then arranged in order of their control scores 
and divided into groups of 6, as indicated below. Note that there are 9 
subjects in the last level. 


55 36 21 19 
51 35 27 18 
47 34 27 WW 


The experimental sample may now fairly be regarded as a representative. 
sample from a hypothetical population showing the same relative distribution 


CONSTITUTING THE "LEVELS" IN AN EXPERIMENT 131 


of the control variable as the sample itself. For the experimental subjects in 
the illustration used, the distribution of control scores is as given in Column 2 
of the table below, the intervals having been defined in terms of the sample 
drawn. The hypothetical population from which this sample may be regarded 
as a representative sample is one in which the proportions of individuals in 
these intervals are as indicated in Column 3 of the table below. 


a) (2) (3) (4) 

Proportions in Proportions 

Score Interval Hypothetical Pop. in Real Pop. 
1) 43.0 and above 6 1333 1210 
2) 36.5 — 43.0 6 1333 1433 
3) 32.87 — 36.5 6 1333 1064 
4) 27.5 — 32.87 6 .1333 .1810 
5) 24.5 — 21.5 6 1333 .0963 
6) 19.0 — 24.5 6 .1333 .1487 
1) 19.0 and below 9 .2000 .2033 


When the levels are constituted in this manner, the distribution of the 
control variable for the hypothetical population will, of course, differ from 
that for the real population in which the experimenter is interested. Certain 
levels will be more heavily represented in the hypothetical population than in 
the real one, and others less heavily. For instance, with reference to the 
illustration, the distribution of criterion measures for the population in 
which the experimenter is really interested may be as indicated in Column 4 
of the preceding table. "Thus we see that the hypothetical population contains 
a larger proportion of individuals than the real population in intervals 3 and 
5 but a smaller proportion than the real population in interval 4. If there 
is an interaction of treatments and levels, and if a given treatment happens 
to be relatively effective in intervals 3 and 5, but relatively ineffective in 
interval 4, that treatment may, on the average, be more effective in the 
hypothetical population than in the real population. Thus the treatment 
(column) means obtained in the experiment may be biased estimates of the 
means of the real population to which the experimenter wishes to generalize, 
but unbiased with reference to the hypothetical population. 

So far as the hypothetical population is concerned, the F-test of the treat- 
ment effect (to be considered later) will be valid (if other necessary conditions 
are met); in the strictest sense, any statistical inferences drawn from this 
test should be restricted to the hypothetical population. Before extending 
any of these inferences to a real population, the experimenter should give 
very careful consideration to possible differences between the real and hypo- 
thetical populations. In general, the populations will differ with respect to 
many specific factors (such as intelligence, age, environmental background, 
heredity, etc.). If a particular factor in which they do differ does not “ inter- 


132 TREATMENTS X LEVELS DESIGNS 


act" with the experimental factor, then that difference, in itself, will offer 
no obstacle to generalization. For example, if the members of the hypo- 
thetical population are on the average much taller than those of the real 
population, and if height does not “interact” with "treatments" — that 
is, if the differences among the treatment effects are the same for tall as 
for short individuals —then the fact that the populations differ in height 
is of no consequence so far as generalization is concerned. The real ques- 
tion, then, is not “Do the populations differ?”, but “Do they differ with 
respect to anything which interacts with the experimental factor?" Fre- 
quently, the best that the experimenter can do is to offer a reasoned opinion 
concerning the answer to this question, but at least it should be a carefully 
reasoned opinion, making use of all that is known about the problem. 

If the experiment is to be controlled on the basis of non-ordered categories 
in a discrete classification, those categories must of course be accepted by 
the experimenter as they are found. If the maximum number of subjects 
is to be employed from among those available, the number selected at random 
from each category will be the largest possible multiple of the number of 
treatments. If the number of subjects available in a certain category is 
less than the number of treatments, this category may be combined with 
another, and the largest possible multiple of a subjects may be drawn at 
random from the combined categories. 


Selection of the Control Variable 


In most instances the principal reason, and often the sole reason, for 
using the treatments X levels design in preference to the simple-randomized 
design is to increase the precision of the experiment. The total sum of squares 
for the entire sample used in the experiment will of course be the same 
whether or not that sample is divided into levels. If the simple-randomized 
design is employed, the "error" sum of squares will be the within-treatments 
sum of squares, obtained by subtracting the sum of squares for treatments 
from that for total (SSeror = S84 = 88r — 554). If the treatments X levels 
design is used, the “error” sum of squares will be the within-cells sum of 
squares, obtained by subtracting the sums of squares for treatments, levels, 
and treatments X levels from the sum of squares for total (SSerror = 887 — 
ssa — 881, — 88ar). The within-cells sum of squares may also be computed 
by subtracting the levels and treatments X levels sums of squares from the 
sum of squares for within-treatments (SSerror = SSwa — SSL — SSA p. At is 
obvious, then, that the error term will often be very much smaller in the 
treatments X levels design than in the simple-randomized design — how much 
smaller will depend on the magnitude of the differences among levels (ssz) 
and on the degree of interaction (ss4;). The magnitude of the differences 
among the levels means depends on the correlation between the control and 
the criterion variable. The higher this correlation, the larger will be the 


TESTING THE SIGNIFICANCE OF THE MAIN EFFECT 133 


levels sum of squares, and the smaller will be the error term. Clearly, there- 
fore, if a number of possible control variables are available for an experiment, 
that one should be used which shows the highest correlation with the criterion. 
So far as precision alone is concerned, the choice should always be based 
strictly on empirical considerations, or only on the known correlation with 
the criterion, and on the convenience or economy with which the control 
may be employed, even though the apparent “logic” of the situation may 
suggest another control variable. Sometimes more than one control variable 
may be used, and the best linear composite of these variables (determined by 
methods of multiple regression) may be employed as a basis for constituting 
the levels. Sometimes, also, an "initial" measure of the criterion variable 
itself may show a higher correlation with the “final” measures (the criterion 
measures) than any other available control variable, and may therefore be 
used as the control variable. 


Testing the Significance of the Main Effect of Treatments 


We are now ready to consider the test of significance of the main effect 
of treatments. This is a test of the hypothesis that the various treatments 
would have the same effect on the mean of the criterion scores for the popula- 
tion from which the experimental sample was drawn. This test is based on 
the ratio (msa/ms,) of the mean squares for treatments and within-cells. 
Under certain conditions, which are listed below, this ratio is distributed as F. 
The last of the conditions listed constitutes the hypothesis to be tested; the 
first three constitute the assumptions on which the test of this hypothesis is 
valid. The conditions! for an F-distribution of ms4/ms, are as follows: 


1) Each treatment group was originally a representative sample from a 
specified population. (That is, each treatment subgroup was origi- 
nally drawn at random from the corresponding level of the given popu- 
lation, the number being drawn at each level being proportional to 
the number of individuals at that level in the whole population. It 
follows from this, of course, that the cell frequencies in the double- 
entry table are proportional from column to column or from row to 
row.) After the administration of the treatments, the treatment, groups 
must be regarded as representative samples from different. (hypotheti- 
cal) populations. The distribution of criterion measures for each of 
these hypothetical populations is that which would have been obtained 
from the original population had each member received a given treatment. 


2) The distribution of criterion measures for the subpopulation correspond- 
ing to each treatment subgroup is a normal distribution. 


1 See footnote on page 51. 


134 TREATMENTS X LEVELS DESIGNS 


3) Each of these distributions has the same variance (c?). 


4) The means of the hypothetical populations corresponding to the various 
treatments are identical (the null hypothesis). 


To prove that ms4/ms, is distributed as F, we will first write the ratio 
between the mean square for treatments (columns) and the mean square for 
within-subgroups, and divide both numerator and denominator by c?. We 
then get 


YXn4M;-MY — YXnjM;- My 
j=l j=l (a 1) 


pi es 02 


(30) 


= 
py Ree Mi MUN M 
[zn — ype /« al) 


Now the mean of any treatment (column) is equal to the weighted average of 
the means of the subgroups (cells) in that column. That is, 


13;Mi; + najMs; +.. o + nMi; t ... + nijMij 


n.j 


M;- 
n; . ny; 
DUM, + My +...+ Mir 
n.i n.i n.; 


Hence, under Conditions 1 and 3, since the variance of the sum of a number 
of independent variables is equal to the sum of their variances, 


X, d UE 21,12 
p ate Qr a nie 
gud EN TELS E. er 

quj Nj Taj N.i nij 

o L 

2 
= = Doni; = o° /n.; 
n.j ici 


Accordingly, under Condition 2, 
n4M.;—-u) |(M;—us. Oe D) 


o a/nj oM.; 


is distributed as x? with one degree of freedom. From this it follows that 


Y» /(M.; — u)’/o’ is distributed as x? with a degrees of freedom. If we now 


take deviations of the M.;'s from the sample mean (M) rather than from the 
population mean (x), the result is still distributed as x? (see page 30) but with 
one less degree of freedom. That is, 


Èn.i(M.;— My 
Eb 


is distributed as x? with (a — 1) degrees of freedom. 


TESTING THE SIGNIFICANCE OF THE MAIN EFFECT 135 


Also, under Conditions 1, 2, and 3, for a single subgroup, 
nij 
Dx - Mi)? 
g 
is distributed as x? with (n;; — 1)df; hence 
nij 
Lx X-M3 
nx Dal - i) 


la 
is distributed as x? with DPI — 1) = (N — al) df. 


el 
Remembering Conditions 1 and 2, we may next note that for any one cell, 
nij 

TUM. ;; is independent of J (X — M,;)’ computed for that cell only, since for 


a random sample from a normal population the mean is independent of the 

variance. Accordingly, the sums of these terms for all cells in any one column 
1 

must also be independent of one another. That is, DM; = M.,, is inde- 


deliv.j 
TI 
pendent of 329 (X — Mij*. Therefore, n.,(M.; — M)? must also be inde- 
i=l 
1 "dj 
pendent of >> (X — May, and the sums of these terms for all columns, 


i=l 


a EF an 

Din.(M.; — M) and YY 22(X- Mi must also be independent. 

"Thus we see that thé: right-hand term in (30) is the ratio between two 
independent x?s, each divided by its own degrees of freedom. Accordingly, 


under Condition 4, we may write 


Yn,(M., - My 
F-— aes = M4, df = (a — D/(N — ab, (31) 
LXXX - Mi 
N — al 


and may employ this ratio between the mean squares for treatments and for 
within-cells to test the given hypothesis. 

Inlerprelalion of a Significant F = ms4/msy: If, in a particular treatments 
X levels experiment, we find that F — ms4/MSw is significant, we may be quite 
certain that at least one of the conditions listed on pages 133-134 was not met. 
In most applications, we may safely conclude that the condition not satisfied 
is the last, that is, we may reject the hypothesis that the treatments have 
the same average effects on the members of the specitied population. How- 
ever, it is important to understand the conditions under which a significant 
may be due to a failure to satisfy one or more of the assumptions basic to 


136 TREATMENTS X LEVELS DESIGN 


this test (Conditions 1 — 3). We will therefore give careful consideration 
here to each of the three assumptions (or conditions) separately. 

The extent to which Condition 1 (representative sampling) is satisfied 
lies wholly within the control of the experimenter. There is usually no good 
reason why this condition should not be completely satisfied, if not with 
reference to a real population, at least with reference to a hypothetical one 
defined to fit the sample. If the “counting off" method of constituting the 
levels has been employed, and the sample is regarded as representative of a 
hypothetical population, Condition 1 may be regarded as fully satisfied if 
the subjects have been randomized with reference to treatments in each level 
independently. To avoid any question of unintentional bias, it is usually 
best to insure randomization by using a table of random numbers in assigning 
subjects to treatments within each level. 

When the sample is thus regarded as representative of a hypothetical popu- 
lation, any statistical inferences drawn from the sample should be restricted 
to this hypothetical population. However, if the counting off method has 
been employed, and if the total experimental sample may be regarded as a 
simple random sample (or the equivalent) from a real population, it is usually 
reasonable to suppose that this hypothetical population will not differ appre- 
ciably from the real population. 

Proportionality of the frequencies of the corresponding treatment sub- 
groups from level to level is important not only for the reasons just given, 
but also because the validity of the analysis of the total sum of squares into 
its components depends on this condition. "There is a possibility that the 
experiment may originally have been properly designed in this regard, but 
that some subjects were “lost” during the course of the experiment, so that 
at its conclusion, the frequencies are no longer proportional. What to do in 
this event will be considered later (page 148). 

The use in educational research of samples consisting of intact school 
classes raises the same problems with the treatments X levels design as with 
the simple-randomized design (see pages 74-75). 

Tt should be noted that the requirement of normality is in general con- 
siderably more likely to be satisfied in applications of the treatments X levels 
design than of the simple-randomized design. In the treatments X levels 
design, it is no longer necessary to assume that the criterion variable (X) 
is normally distributed for the population sampled, but only that X is nor- 
mally distributed for those members of that population who are alike or 
nearly alike with reference to Y, the control variable. What we assume, in 
effect, is that the errors of estimating X from Y are normally distributed for 
the population, rather than that X is normally distributed for the population. 
It is readily conceivable, particularly if the correlation between X and Y 
is high, that the errors of estimate are normally distributed even though the 
distribution of X itself is markedly skewed, or otherwise not normal. 

À similar observation may be made with reference to the assumption of 
homogeneity of variance. There is always a possibility of a strong interaction 


TESTING THE SIGNIFICANCE OF THE MAIN EFFECT 137 


between treatments and levels, in which case the differences among the sub- 
group (level) means for one treatment may be very much smaller than those 
for another; nevertheless, the variance within subgroups may remain essen- 
tially the same from treatment to treatment. If this were the case, the as- 
sumption of homogeneous variance would be invalid if the simple-randomized 
design were employed, but valid if the treatments X levels design were em- 
ployed. 

In any event, a considerable departure, either from normality in the cri- 
terion distribution, or from homogeneity of variance among various cells of 
the table, is permissible; yet the sampling distribution of the ratio of mean 
squares for treatments and within-cells will remain essentially the same 
(see page 78 ff.). 

It is very important to observe that the test of the main effect of treat- 
ments takes no cognizance of Type G or Type R errors. If any uncontrolled 
extraneous factors have been associated with the experimental treatments, 
these factors, so far as this test is concerned, are regarded as a part of the 
treatments themselves. Furthermore, the inferences drawn from the sample 
must be restricted to the population (real or hypothetical) from which the 
entire experimental sample may be regarded as a representative sample. 
For instance, if a learning experiment is conducted in a certain school, the 
available subjects may be regarded as a representative sample from a “ popu- 
lation" corresponding to that particular school only. The effectiveness of 
the treatments for this particular school may not be the same as for other 
schools. Hence, if the observed treatment effect is to be regarded as an 
estimate of the treatment effect for a population consisting of a large number 
of such schools, a Type R error is present which is not taken into considera- 
tion in this test. 

In most applications of the treatments X levels design, then, if we find 
a significant F = ms4/ms,, we may safely conclude that a failure to satisfy 
Conditions 1 to 3 could hardly be responsible, and that the hypothesis con- 
tained in Condition 4 (equal treatment means) must be false. Nevertheless, 
Conditions 1 to 3 should never be taken for granted. In every actual ex- 
periment, each should be carefully reviewed. 

It should be emphasized that the test of the hypothesis of equal treat- 
ment means does not involve any assumption about the presence or absence 
of an interaction between treatments and levels. Whether or not there is 
any such interaction, the F-test of the main effect of treatments is valid so 
far as Type S errors are concerned. If there is no interaction, the fact that 
the total experimental sample is neither a strictly random nor a strictly repre- 
sentative sample from the real population is of much less serious consequence 
than otherwise. If there is no interaction, each observed treatment mean 
may be a biased estimate of the corresponding real population mean, but 
all observed treatment means will be biased alike. Accordingly, one can 
draw valid inferences about the relative effectiveness of the treatments in the 
real population, even though the total experimental sample is biased so far 
as the distribution of criterion measures is concerned. 


138 TREATMENTS X LEVELS DESIGNS 
Test of the Significance of the Interaction 


The statistical test of the hypothesis of no interaction is based on the 
ratio of the mean square for treatments X levels to that for within-cells 
(ms4r/ms.). The conditions under which this ratio is distributed as F are 
listed below. "The last of these conditions constitutes the hypothesis to be 
tested. The first three are the assumptions basic to the test. 


1) Each treatment subgroup has been randomly selected from the cor- 
responding subpopulation in the parent population. 


2) The distribution of criterion measures for each of these subpopulations 
is normal. 


3) All of these distributions have the same variance (0°). 


4) The corresponding subgroup frequencies are in the same proportion 
from level to level. 


5) The difference between subpopulation means for corresponding treat- 
ments is the same for all levels. 


We have seen that in the table in which all differences among row and 
column means have been eliminated by arithmetic corrections (the twice- 
corrected table, page 111), the interaction sum of squares is identical with 
the sum of squares for between-cells (ss47 = ss” ou In this twice-corrected 
table, the ratio between the mean squares for cells and within-cells may 
be written 


ys Mies £ 
DIU MY ap 


i=l j=l 
MScetis r4 (L- D(a- D i-l x e? /ni; (92) 
md SS it M | a EU - Mii)" 
fr — dug qo (N — al) 


The third form of the ratio above is obtained by dividing both numerator 
and denominator of the middle ratio by c?. From the third form in which 
this ratio is written, it is apparent that under Conditions 1, 2, and 3 the 
ratio is that between two x?'s, each divided by its degrees of freedom. (The 
student should be able to show why each of these three conditions is neces- 
sary.) 

The proof that the mean square in the numerator is independent of that 
in the denominator is similar to that given on page 135. For any given 
subgroup, M ;; is independent of Z(X — M;,)?, on the assumption (Conditions 


TEST OF THE SIGNIFICANCE OF THE INTERACTION 139 


l and 2) that each subgroup is a random sample from a normal population. 
Hence, (M;, — M)? must also be independent of Z(X — M;;). Accordingly 


k ia a » l a 
ELM; — M)? must be independent of LLLA — M;,)*, and hence 
the numerator of (32) is independent of the denominator. 


We have thus shown that (32) is the ratio between two independent x”’s, 
each divided by its degrees of freedom. Accordingly, we may write 


F = ms4,/mszw, for (| — 1)(a — 1) and (N — al)df, 


and may use this ratio to test the stated hypothesis. 

It may be shown that if there is no interaction in the population, the ex- 
pected value of ms; is c*, or that msaz is an unbiased estimate of c?. 

Inlerprelalion of a Significant F = ms,z/ms,: Suppose that in an actual 
experiment the ratio of the mean squares for treatments X levels and within- 
cells is too large to be reasonably attributed to chance, and therefore, that 
the hypothesis of no interaction is rejected. 

We may note first that the test of this hypothesis is not based on any as- 
sumption that the total experimental sample is a representative sample (with 
reference to levels) from any specified population (real or hypothetical). 
This is why Conditions 1 and 4 on page 138 have been substituted for Condi- 
tion 1 on page 133. This hypothesis is concerned only with whether or not 
any interaction exists, and not with the effect of any possible interaction 
on the over-all treatment comparisons. All that is necessary, therefore, is 
that the numbers drawn from the various levels are in the same proportion for 
all treatments, and that the subjects are randomized with reference to treat- 
ments in each level. Conditions 1 and 4 (page 138) are almost wholly within 
the control of the experimenter and, if proper care is exercised in setting 
up the experiment, these conditions should be known to be satisfied so far as 
the hypothetical population is concerned. For reasons given earlier, this 
test of significance is not very sensitive to moderate departures from Condi- 
tions 2 and 3; all we need do is assure ourselves that there is no very marked 
departure from normality nor from homogeneity of the within-cells vari- 
ances. Hence, in general, a significant F means that the hypothesis of no 
interaction (Condition 5) is false, or that the differences between correspond- 
ing treatment means from level to level are too large to be due entirely to 
sampling fluctuations (Type S errors). 

It does not necessarily follow from this, however, that there is any intrinsic 
interaction between treatments and levels. To what extent a significant F 
may be taken as an indication of an intrinsic interaction depends in part 
on whether the treatments have been administered independently at each 
level or on a group basis for all levels simultaneously, and in part on the 
extent to which extraneous factors have been controlled or equalized through- 
out the entire experiment. If the treatments have been administered at each 
level separately, so that each level constitutes an independent replication 


140 TREATMENTS X LEVELS DESIGNS 


of the whole experiment, then the observed variability in the differences be- 
tween treatment means from level to level could be due primarily to the effect 
of Type G errors which vary from level to level. If a significant F is obtained 
in an experiment thus administered, it is quite possible that Type G errors 
alone account for its significance. The observed interaction, then, may be 
“significant,” and yet not “intrinsic.” In this design, as we have already 
voted, there is no possibility of differentiating the intrinsic from the extrinsic 
interaction. 

If the treatments are administered on a group basis to the subjects from 
all levels simultaneously, then many (but not necessarily all) of the extrane- 
ous factors associated with treatments will tend to have the same effect at 
all levels. Any extraneous factors which are thus additive in their effect 
will of course make no contribution to the interaction mean square, even 
though they do affect the treatments mean square (that is, even though they 
do affect the column means). It is quite possible, however, that some extrane- 
ous factors associated with treatments may themselves “interact” with levels. 
In that case, a part of the observed interaction will again be due to error, or 
will be extrinsic. Accordingly, if each treatment has been administered on a 
group basis to subjects from all levels simultaneously, a significant F is much 
more likely to indicate the presence of an intrinsic interaction than if the 
treatments have been administered independently (and Type G errors ran- 
domized) at each level separately. In neither case does the significant F 
necessarily mean that an intrinsic interaction is present. In either case, 
the more carefully controlled the experiment, the more surely does a sig- 
nificant F imply an intrinsic interaction. 

The fact that the obtained F = ms4,/ms, is significant implies nothing 
whatever about the homogeneity or heterogeneity of the interaction. The 
significant F could be due entirely to the nonadditive effects of only one treat- 
ment; there may be no interaction at all among the remaining treatments. 
Indeed, a significant F could be due to what has happened to a single treatment 
subgroup, as, for example, to a large experimental error that is unique to 
that subgroup. However, it should be noted that the F-test of interaction for 
the table as a whole is not very sensitive to an interaction affecting only a 
small part of the entire table. It is quite possible, therefore, that an inter- 
action characteristic of only a small part of the table, and possibly of con- 
siderable consequence so far as that part of the table alone is concerned, 
may not be detected by the over-all test for interaction. 

If the observed interaction is significant and if an inspection of the data 
within the table suggests that this interaction is heterogeneous, one can 
apply tests of interaction independently to selected segments of the table, or 
to the treatments taken two at a time. If it appeared that only one of the 
treatments was responsible for the observed interaction, the data for this one 
treatment could be excluded from the table, and a test of interaction applied 
to the remaining table alone. Tests of interaction applied to only part of 
the table, however, are of dubious value if the interaction for the table as 


THE MEANING OF ms,/ms,, 141 


a whole had proved nonsignificant. To apply a test of interaction to selected 
segments of the table would in this case be to introduce a fallacy very similar 
to that discussed on pages 48-49 — that of applying to the largest of a number 
of observations the probability appropriate only for a single observation 
selected at random. If the interaction for the table as a whole is not sig- 
nificant, but the data suggest that there is an interaction for a part of the 
table alone, the safest procedure is to design an independent experiment only 
for the particular treatments and levels involved, and then apply an inde- 
pendent test of significance in this experiment. 


The Meaning of ms,/ms,, 


If there is an intrinsic interaction between treatments and levels for a 
population, ms4/ms yz, will have an F-distribution with (a — 1)/(a — 1)(1 — 1) 
degrees of freedom only if the AL interaction effects are normally distributed, 
if the levels of L represented in the experiment may be regarded as a random 
sample from a population of such levels, and if the null hypothesis concern- 
ing treatment means is true. This interpretation is ruled out by the man- 
ner in which we have defined “levels” of the control variable in the treat- 
ments X levels design. However, if the number of subjects per treatment 
is large, the ratio ms4/msaz still has a useful meaning (although not as an 
F-ratio to test the hypothesis that the means of the specified treatment 
populations are identical). 

We have noted earlier that, so far as any lwo treatments are concerned, 
an intrinsic interaction may mean either of two things. It may mean that 
one treatment is superior to the other at all levels, but that its superiority is 
more marked at some levels than at others; or it may mean that one treatment 
is superior at some levels and that the other is superior at other levels. It 
would be worthwhile, obviously, to know which of these situations exists in 
any given instance. Neither of the tests of significance already con- 
sidered, however, contains any implications with reference to these two 
possibilities. 

To reveal the meaning of ms4/ms4;, let us consider a table in which n;; = n 
is constant, and in which the level (row) means have been equalized by con- 
stant arithmetic corrections within levels. These corrections, of course, 
will have no effect upon the column means, so that a given corrected column 
mean, such as M’, will be the same as the corresponding original mean, 
in this case M.» In this once-corrected table, if there is a real interaction, 
there will probably still be significant, differences in subgroup (cell) means 
within each treatment. These means, when plotted on a linear scale, might 
be distributed, for example, in either of the following ways, among many: 


142 TREATMENTS X LEVELS DESIGNS 


M'i ———_ 
Mig Mi, 
1 
M'a — | M'a7 Ma e 
M'g — 
M'a — 
— M's-2 Ms 
—M'ı=Mı M'a 
M'’n —M'ı= Ma 
M'3; ——— —— M’s=Ms 
M'g — 
M's ————_ 
Mig —À umen 
M'a _|- M':= Ms M's — 
Mio M'» 
M'u 


Ficure 5. Possible distributions of treatment and correcled subgroup means 


Let sj; , represent the variance of the observed treatment means, and si, 
represent the average variance of the corrected subgroup means for the 
various treatments. That is, 

ÈM. - My 


i=l 


2 = 
= T AA 


and 


, 1 e 205 — M)’ 
sini, => ee 


The corresponding standard deviations are then sy, and sm, Along the 
left-hand scale in Figure 5, we note that the corrected subgroup means for 
Treatment As are closely concentrated around the A» mean (M 5), that. the 
three corrected subgroups means for A; are closely concentrated around M 4, 
and similarly for As. Along this left-hand scale, the differences among the 
three corrected subgroup means for any treatment are small compared to 
the differences among the three treatment means. That is, sy, is consider- 
ably larger than sy’, ,- 

In the right-hand figure, on the other hand, the three corrected subgroup 
means for Treatment A; are widely scattered around M ; in comparison with 
the spread of the treatment means. In other words, sx, is small in relation 


to Su’ ge 


—— 


THE MEANING OF ms,/msa, 143 


We can see from this illustrative figure that if sy, is very much larger 
than 557, then the treatments will have the same rank for each level, or 
approximately the same rank, even though the superiority of one treatment 
over another at one level may be very much greater than at another level. 
In the left-hand figure, for instance, A» ranks above A; and A; at each of 
the three levels, and A; ranks third at each of the three levels, even though the 
superiority of A» over A; for Level 1 is very much greater than the superiority 
of As over A; for Level 3. In the right-hand figure, however, As ranks highest 
at Level 1, but A; ranks highest at Level 2, and A, ranks highest at Level 3. 

The higher the ratio sy ,/sy’,, the more surely will the rank order of the 
treatments be the same at each of the levels. If sa, is at least twice as large 
as sy',,, then, in general, the rank order of the treatments is approximately 
the same at all levels, even though there may be minor variations from level 
to level. The ratio between these two standard deviations is somewhat 
analogous to the ratio between the standard deviation of obtained scores and 
the standard error of measurement on a standardized test. To require that 
the standard deviation of obtained scores be twice the standard error of 
measurement is to require that the reliability coefficient of the test be at 
least .75. We know from experience that for a test of this reliability, ex- 
aminees ranked in order of their obtained scores are also ranked approximately 
in the order of their true scores. 

In order that sy, be at least twice as large as sy’, ,, it is necessary that 
ms4/ms41 be at least 4(J — 1). The proof of this follows. 

In either situation, the standard deviation of the over-all treatment means 
(assuming n; = n is constant) is given by 


xw, — M} ES. ln. Em. - =M) 


j=l 1 


Pan a a nj a—1 


= V —— -msa 
aln 


and the “average” standard deviation of the subgroup (cell) means within 
treatments is given by 
a t 
Xowu-Mj  |(-0G-5,n320M5- My 
l% Ds 1-1 Ls ely j=l il 
i. uy al n (l-1)(@-1) 


Bt 


This follows since M; = M, from which (Mi; — M; — M; + M) = (Mi; — M.) 
Hence, the ratio between these s's is given by 


Sir = 


s (a — 1) = 
urs aln msa J l mss | 
Sug (a — 1)(l— 1) msaz I-1 MS 4, 


aln 


144 TREATMENTS X LEVELS DESIGNS 


Accordingly, if sar,/sar’, , is to be greater than 2, ms4/ms,; must be greater 
than 4(|[— 1). For simplicity, we may as well require that ms4/msaz > 4l. 

The preceding may be taken as a basis for the following “rule-of-thumb”: 
If the ratio between the mean squares for treatments and treatments X levels 
is several times as large as the number of levels, and if the number of cases in 
each treatment group is fairly large, one may quite safely conclude that the 
rank order of effectiveness of the treatments is approximately the same 
within each level, even though their relative effectiveness, precisely de- 
termined, may differ from level to level. 

This is a very rough rule-of-thumb and is offered with considerable hesi- 
tation, since rules-of-thumb in general tend to be rather uncritically applied. 
If the total number of cases is quite large and Type G errors are negligible, 
the rule should be quite useful, but it is not possible to suggest exact critical 
values concerning the size of the sample or the magnitude of the ratio beyond 
which it is safe to generalize about treatment effects to all levels. 

The preceding interpretation of ms4/msaz assumes an intrinsic interaction. 
On the assumption that there is no interaction (either intrinsic or extrinsic), 
the ratio has another meaning. In that case, the interaction mean square is an 
unbiased estimate of the common within-cells variance (see page 139), and 
may be used in place of the within-cells mean square as the error term for 
testing the significance of the treatment differences. That is, on the assump- 
tion that there is no interaction, we may write 


F =msa/msaz, for (a — 1) and (a — 1)(l — Y)df, 
or 
= SSAL F S8w - ant 
F = msa / g e for (a — 1) and (N - a - l+ df, 

and use either of these ratios to test the treatments effect.! However, this 
assumption is nearly always a dangerous one, particularly due to the ever- 
present possibility of extrinsic interaction; hence, even though the assump- 
tion seemed reasonable, we would use this ratio to test the treatment effects 
only if the more valid test earlier considered could not be applied. 


1 The proof that the second of these ratios is distributed as F is as follows: Suppose 
that in the double-entry table arithmetic corrections are applied (see page 110) so as 
to make all level (row) means equal to the general mean. In this once-corrected table, 
since there is no interaction, the differences among the subgroup (cell) means within 
each treatment (column) are due only to random sampling fluctuations. In other 
words, in this once-corrected table, the measures in each treatment group (column) 
may be collectively regarded as a simple random sample; the design therefore is a 
simple-randomized design, in which the ratio of the mean square for treatments to 
the mean square for within-treatments is known to be distributed as F. However, in 
the once-corrected table ssj,4 = $847 + SS 554 = ss4, and the degrees of freedom for 


] > SS Ar, + 88, 
ssa = (a— D(— 1) + (N— al) = (N-a- l+ 1). Thus, mea = 4— 1+) 


J , s: . e E 
hence, ms4/ms,4 = DE is distributed as F. 


ONE OBSERVATION PER CELL 145 


'The situation is quite different if there is an extrinsic interaction but 
no intrinsic interaction, and if the extrinsic interaction is due only to ex- 
perimental errors which have been randomized independently for each level 
of the experiment. If this may be assumed to be the case, the interaction 
mean square becomes the appropriate error term, and one which takes both 
Type S and Type G errors into consideration. The logic of the test of sig- 
nificance to be applied in this case will be discussed later (Chapter 8, pages 
201-202). 


Treatments X Levels Designs with One Observation per Cell 


'The practice has sometimes been followed in treatments X levels experi- 
ments of arranging the subjects in order of the control variable, and then 
counting off just a (the number of treatments) subjects at a time, letting each 
group of a subjects correspond to a separate level. In this case there is only 
one observation in each cell of the table; there is consequently no possibility 
of computing a within-cells mean square. The total sum of squares can 
nevertheless be analyzed into its treatments, levels, and treatments X levels 
components. In this case, on the assumption. that there is no interaction (either 
intrinsic or extrinsic), the significance of the treatment differences may be 
tested by means of 


F = msa/msar, df = (a — 1)/(a — 1)(L— 1). (33) 


In this case, ms4; is presumably due to random Type S errors only, and msa 
would be interchangeable with ms, if the latter were available. 

This test of significance is also valid on the assumption that there is no 
intrinsic interaction, and that the extrinsic interaction is due only to experi- 
mental errors that have been randomized independently at each level. (See 
pages 201-202.) 

If the treatments have been administered on a group basis to all levels 
simultaneously, and if there is an interaction (either intrinsic or extrinsic, 
or both), the observed interaction mean square will be larger than the mean 
square for within-cells had there been two observations per cell. "That is, 
the “error” term will then contain an effect (interaction) which has been con- 
irolled in the experiment and therefore does not belong in the error term. 
It may be argued that this will only tend to make the ratio smaller than it 
would otherwise be, and that, therefore, this F-test, while strictly invalid, 
will err on the conservative side. In other words, it is argued that if the 
treatments effect is “significant” by this test it would prove even more sig- 
nificant if a within-cells error term could be employed. It is true that when 
an inflated error term is used, the risk of a Type I error is less than the level 
of significance of the test indicates, but it is also true that the risk of a Type 
II error is considerably greater than when a strictly valid error term is em- 
ployed, In general, there is no good excuse for using an inyalid error term 
when a valid one is readily available. 


146 TREATMENTS X LEVELS DESIGNS 


The use of msaz as an error term when n — 1 has been wrongly en- 
couraged in some texts, and the student should be on guard against this 
practice. 


Tests of Significance Applied to Individual Differences 


If F = ms4/ms, proves to be significant at the selected level of significance, 
the experimenter will usually wish to identify the differences for individual 
pairs of treatments that are significant at the same level of significance. 
If F = msaz/ms, proves to be significant, he may also wish to identify the 
significant differences among the treatments at each level of the control 
variable separately. This is done in exactly the same manner as with a 
simple-randomized design (see pages 90-96), the estimated standard errors 
of all means being again based on ms,. Specifically, 


ms, [mo ms, 
est’d ey ; = a est'd oy, = V —*; est'd ey; n 


1. hi 


bj n. 
est'd ay, Lan, = M ms» lo n.) ete. 
13 14. 


The number of degrees of freedom for any of these tests is that for the error 
mean square (ms,), namely (N — al). 

When ms, computed from the entire table is used as the error term in 
testing the significance of the difference between two particular means, it 
is of course assumed that the within-subgroups variance is homogeneous for 
the entire table. If this assumption is questionable for the whole table, 
but one can safely assume that the within-subgroups variance is homogeneous 
for the two groups or subgroups being compared, then it would be better in 
these F and t-tests to use an error term computed only for the groups or 
subgroups involved in the comparison. 


Possibilities of Confounding Extraneous Factors with Levels 


In some experimental situations, certain Type G errors may be unavoid- 
able, but some of these unavoidable errors may, nevertheless, be assignable to 
treatment subgroups by the experimenter. For example, the task of "run- 
ning" the subjects in an animal experiment may have to be divided among a 
number of laboratory assistants, with the possibility that systematic differ- 
ences in extraneous factors may be associated with these administrators of the 
experiment. Again, it may be necessary to run the animals on different days, 
and systematic day-to-day differences in extraneous factors may increase 
the variability of the results. In such instances the experimenter may some- 
times be able to “confound” the extraneous factor with the levels factor. 
For example, he might assign the treatment subgroups at one level (or com- 
bination of levels) to one administrator, those at another level to another 


LIMITATIONS AND ADVANTAGES 147 


administrator, etc., or he might run the animals at one level on one day, those 
at another level on another day, ete. The extraneous factor would then tend 
to create differences among levels, but not among treatments within the 
same level. The effect of the extraneous factor would then be “taken out” 
of the error mean square and, to some extent, out of the treatments and inter- 
action mean squares as well. 

Whether or not it is desirable to confound extraneous factors with levels 
in this manner depends upon the purposes of the experiment, and also upon 
whether or not the extraneous factor interacts with the experimental factor 
(treatments). If the treatments X levels design is used only for the purpose 
of increasing the precision of the experiment, and the experimenter has no 
interest in the treatments X levels interaction for its own sake, then con- 
founding unavoidable variations in extraneous factors with levels is highly 
desirable, and the possibility should be exploited to the maximum. Some- 
times, however, one of the purposes of the experiment is to study the inter- 
action of treatments X levels, and the ratio ms4;/ms, is used to test the 
hypothesis that there is no intrinsic interaction. In this case, the interaction 
term should, of course, be kept as free as possible of extraneous interaction 
effects. Accordingly the experimenter's objective must be to eliminate Type G 
errors entirely. Certainly in this case one would not wish to confound an ex- 
traneous factor with levels unless he was very confident that the extraneous 
factor does not interact with treatments. 


Limitations and Advantages of the Treatments X Levels Design 


'The basic limitations and advantages of the treatments X levels design 
have already been considered in Chapter 1 (pages 13-16), but it may 
be well to review them here in relation to the generalized form of the 
design. 

'The treatments X levels design is primarily intended to serve the same 
purposes as the simple-randomized design, that is, to determine whether or 
not a number of treatments are, on the average, equally effective for the mem- 
bers of the specified population. The major advantage of the treatments X 
levels design over the simple-randomized design is that for the same number 
of subjects the treatments X levels design is more precise and thus usually 
more efficient. How much more precise the treatments X levels design is 
depends on the correlation between the control and the criterion variable. 
How much more efficient the treatments X levels design is depends on how 
the cost of organizing the subjects into levels (including the cost of securing 
the control measures by which the levels are constituted) compares with the 
cost of securing the same increase in precision with the simple-randomized 
design simply by adding more cases. 

A second advantage of the treatments X levels design is that it may yield 
valuable information concerning a possible interaction between treatments 


148 TREATMENTS X LEVELS DESIGNS 


and levels that could not be derived from a simple-randomized design. In 
most treatments X levels experiments, this information is of theoretical inter- 
est only, since whatever treatment is to be employed in practice will ordinarily 
be used with heterogeneous groups containing individuals from all levels. 
However, a demonstration of a marked and intrinsic interaction may some- 
times lead to the use in practice of different treatments at different levels. 

Another important advantage of the treatments X levels design has already 
been pointed out on pages 136-137, where it was shown that the assumptions 
underlying the test of the significance of the treatments effect are in general 
much more likely to be valid with the treatments X levels than with the 
simple-randomized design. 

The principal limitation of the treatments X levels design is the same as 
that of the simple-randomized design. Neither design takes cognizance of 
Type G or Type R errors (with the single exception explained later on pages 
201-202.) 

It is worth noting that the treatments X levels design is particularly 
satisfactory for the purpose of testing for the presence of an intrinsic treat- 
ments X levels interaction when each treatment may be simultaneously 
administered on a group basis to subjects at all levels of the control variable. 
The illustration given on page 14 is of this character. In this case the possi- 
bility of systematic differences from one treatment subgroup to another 
within the same treatment due to extraneous factors is minimized, and hence 
the likelihood of an extraneous interaction is minimized also. 


What to Do About Missing Cases 


Sometimes, in a treatments X levels experiment, one or more cases may be 
“Jost” during the course of the experiment; a subject may be unable to con- 
tinue to participate in the experiment, a rat may die, the data for a particular 
subject may be rendered unusable by an accidental failure to administer 
the treatment to him properly, etc. In this case, the requirement of pro- 
portionality of cell frequencies from row to row or column to column of the 
double-entry table will not be satisfied with the incomplete data, and one 
may not proceed with the analysis of the total sum of squares until the 
missing data have been "replaced." The simplest and most practicable pro- 
cedure in general is to replace each missing datum by a value equal to the 
mean of the remaining observations in the same treatment subgroup (cell) 
and then to proceed in the usual manner, having first subtracted one degree 
of freedom for each of the missing observations from the degrees of freedom 
for total, and thus also from the degrees of freedom for error. When this is 
done, the test of significance will no longer be exact, but if only one or two 
cases are missing and the number of degrees of freedom for error is reason- 
ably large, the test will nevertheless be sufficiently exact for most practical 
purposes. 


THE USE OF TRANSFORMATIONS 149 


The Use of Transformations 


The use of transformations with the treatments X levels design presents 
certain problems which are briefly considered here for the student with ad- 
vanced interests in the subject of transformations, but which are hardly 
appropriate for inclusion in a first course in experimental design. The fol- 
lowing discussion, therefore, may be skipped by most students using this 
text. (See footnote on page 89.) 

The null hypothesis concerning the treatment population means of the 
original measures (Hi : ua = ua = +++ = H.a) is equivalent to the null hypothe- 
sis concerning the treatment population means for the transformed data 
(Hs : p1 = us = ... = H'a) only if the transformed measures show exactly 
the same distribution for each treatment population. (In the subsequent dis- 
cussion, all primed terms are based on the transformed measures, the un- 
primed terms on the original measures.) The form of the distribution of 
criterion measures for a treatment population depends both on the form of 
the distribution within subpopulations (within-cells) and on the distribution 
of subpopulation (cell) means. Accordingly, if the subpopulation distributions 
are all normal and of the same variance, then Hi is equivalent to H, only 
if the distribution of subpopulation means is the same for all treatments. 
This could be true, of course, if there were no AL’ interaction. Theoreti- 
cally, it could also be true even if there is an AL’ interaction, as is demon- 
strated in the following figure. This 
figure presents the graphs of treat- Ly 
ment subpopulation means for the 
various levels in an experiment involv- 
ing three treatments and three levels. 
A situation such as that diagrammed, 
however, is so inconsistent with the 
usual behavior of physical and psy- La 
chological laws that it need hardly be 
considered as a practical possibility. M A As 
For practical purposes, therefore, we 
may say that the distribution of subpopulation means will be the same for 
all treatment populations only when there is no AL’ interaction. The ratio 
F = ms',/ms', provides a valid test of Hz only if the subpopulation (cell) 
distributions are normal. Accordingly, this ratio provides a yalid test of 
both H and H, only if the conditions listed on pages 133-134 are satisfied 
with the transformed data, and if, in addition, there is no AL’ interaction. 
1f the other conditions are satisfied, however, the ratio will provide a valid 
test of H» whether or not there is an AL’ interaction. In specific instances, 
it is quite possible that a transformation may be found which normalizes the 
subpopulation distributions and makes them equally variable, and also 


Le 


150 TREATMENTS X LEVELS DESIGNS 


renders the AL’ interaction equal to zero. In that case, of course, both Hi 
and //, can be readily tested with the transformed data by means of 
F = ms', / msh. 

If there is an AL’ interaction, then to test H, with the transformed data, 
we must transpose F into the equivalent hypothesis on the transformed scale. 
The hypothesis equivalent to Hi will then take the form Hs : ui = kau... = 
ku, with k possibly differing for each value of j. For example, suppose 
that in an experiment involving just two treatments and two levels, with two 
observations in each subgroup, the transformation is Y — log X and the 
transformed measures are as given below. 


Transformed Measures Original Measures 
Ai Ag Ay Ay 
4 5 10,000 100,000 
Lı Lı 
3 4 1,000 — 10,000 
3 2 1,000 100 
La La 
2 1 100 10 
Means 3.0 3.0 Means 3,025" "21,527.25 


Suppose that the means and variance estimates for the various subgroups 
happened to coincide with the corresponding subpopulation values. We 
note that the variances of the subpopulation are all the same. We see also 
that there is no main effect for the transformed measures (the mean for 
both treatment populations is 3.0) but that there is an interaction. Never- 
theless, the treatment population means of the original measures are 3025 
and 27,527. 

If the treatments represent differing amounts of a common experimental 
variable, H may sometimes be expressed as Hs : u; = F(z;), z being the ex- 
perimental variable. If one can express Hs in either of the forms just sug- 
gested, one can then test this hypothesis with the transformed data in the 
manner to be explained later in Chapter 15. 

The hypothesis that there is no AL interaction can never be equivalent 
to the hypothesis that there is no AL’ interaction. Furthermore, the inter- 
action cannot be homogeneous on both scales. If we let c;; represent the true 
interaction effect in cell ij, then the hypothesis of no AL interaction may be 
expressed H; : ci; = 0 (for all values of i and j). The equivalent hypothesis 
on the transformed scale would then take the form H; : ki; = b (for all values 
of i and j), b representing a constant of unspecified value, and k possibly differ- 
ing for each combination of i and j. That is, if there is no AL interaction, 


STUDY EXERCISES 151 


there will be an AL’ interaction and the AL’ interaction will be heterogeneous. 
The writer has not investigated the difficulties involved in expressing H; in 
a form in which it could be tested with the transformed data, but presumably 
these difficulties would be considerable. 

The hypothesis that there is no AL’ interaction (Hs) can, of course, be 
tested with the transformed data, granting that the conditions listed on page 
138 are met. In any psychological experiment, the hypothesis (H;) that there 
is no AL interaction represents a different psychological law than the hypothe- 
sis (H;) that there is no AL’ interaction. Sometimes, once the transformation 
has been selected, the experimenter may conclude that the law in which he is 
really interested after all is that represented by He rather than by Ha. In 
that case, of course, the difficulties considered at the close of the preceding 
paragraph will be obviated. 

An exact hypothesis that can be expressed in terms of the original scale 
can always be expressed in equivalent form in terms of the transformed scale. 
This represents a specific instance of a more fundamental truth of consider- 
able significance. It is that any psychological or natural law is independent of 
the scales on which it is expressed; the law exists apart from the scale, but 
may take a different form for different scales. One scale may lead to a much 
simpler form of the law than others, and is to be preferred for this reason. 
The scale which leads to the simplest expression of the law, however, is not 
necessarily that which permits the simplest statistical test of the law (or 
hypothesis) in an experimental situation. Quite obviously these observations 
have very significant implications for the whole problem of scaling in psycho- 
logical and educational measurement. 


STUDY EXERCISES * 


1. A simple methods experiment, designed to compare two methods of 
teaching a week-long unit on community water supplies, was carried out in 
Grade VI of a particular school. Before the experiment began the thirty 
pupils were ranked on the basis of IQ scores and divided into three levels of 
ability with ten pupils in each level. Within each level, the pupils were ran- 
domly assigned, five to each of the two method groups, A and B. All of the 
pupils spent the first two days of the experimental week in general class dis- 
cussion of the topic. The pupils in Group A spent the remaining three days in 
individual study on the unit in the school library. The pupils in Group B 
spent the last three days in the classroom where they prepared for a film which 
was shown on Thursday — and spent Friday in reviewing the film and the 
unit in general. A standardized test over the unit was administered to the 
entire class on the following Monday and the scores on this test served as 
criterion measures. The results were as follows: 


1See second paragraph on page viii. 


152 TREATMENTS X LEVELS DESIGNS 


Criterion Test Score 


Method A Method B 
82 95 
71 89 
Superior (1) 73 92 
63 TT 
60 69 
58 72 
76 80 
Average (2) 69 65 
65 84 
62 66 
46 53 
57 63 
Inferior (3) 42 61 
54 57 
56 58 


Partial Computational Results 


f Al Bl A2 B2 A3 B3 
Tu= OX 349 422 330 367 
5 
DAS 24663 36100 21970 27221 
Tå/5 24360 35617 21780 26938 


a) Complete the analysis of the data and prepare a summary table. 


b) Define the two treatment populations in terms specific to this experi- 
ment. 


€) State the hypothesis tested by the ratio ms,/ms,. Can this hypothesis 
be rejected at the 5% level? 


d) Why is the variance of the criterion measures for the subpopulation cor- 
responding to the middle level probably smaller than for either extreme 
level? In view of the results of the Norton study, is the heterogeneity of 
variance likely to affect seriously the validity of the usual F-tests of 
interaction and of treatment effects? Explain. 


e, 


2 


Granting the assumptions underlying the test in (c) to be adequately 
satisfied, suggest some possible explanations for the observed result other 
than a real difference between the methods. 


STUDY EXERCISES 153 


f) Apply a test of significance of the treatment differences in the manner 
that would have been appropriate had the two treatment groups been 
simple random samples rather than matched samples. Why is ms4/ms,, 
smaller in this case? If the groups had been simple random samples, 
would you expect the absolute differences in means to be larger or smaller 
than in this experiment? Why? 


g) Compute the ratio msa;/ms,. How do the conditions which make this 
ratio an F differ from those related to the ratio ms,/ms,? May the 
hypothesis of no interaction be rejected at the 5% level? 


h) How does the outcome of the test of no interaction affect the interpreta- 
tion of the test of no methods difference? 


i) What feature of this experiment makes an extrinsic interaction unlikely? 
What factors, operating in what manner, could conceivably give rise to 
an extrinsic interaction? 


2. A rat-feeding experiment involved three diets and five replications at 
different levels of initial weight of the rats used. The 30 heaviest rats in a 
stock of 150 were randomly assigned, 10 to Diet 1, 10 to Diet 2, and 10 to Diet 
3, and a simple controlled experiment was conducted at this level. Similar 
experiments were independently performed at each of four other weight levels 
(the second 30 heaviest rats, etc. — down to the 30 lightest rats). In each 
experiment, the rats under each diet were kept in a separate cage, and all 
possible factors affecting cages systematically were randomized with reference 
to cages. For example, the cages were randomly assigned to their permanent 
positions in the rat room. At the close of the experiment the total gains for 
individual cages were as follows: 


Total Gain in Weight per Cage (n = 10 per cell) 
Diet 1 Diel 2 Diet 3 Total T^,,,/30 


Level 1 5 21 40 66 145.2 
Level 2 22 11 46 79 208.0 
Level 3 37 30 Al 108 388.8 
Level 4 18 31 54 103 353.6 
Level 5 43 47 77 167 929.6 
Total (Tp) 125 140 258 523 

T?’p/50 312.5 392.0 1331.3 7/150 = 1823.5 


Data not available from above table: 
ZzzX'- 2563 Sum of squared gains for individual rats 
Z2T^, = 22925 Sum of squared cage totals 


154 


a) 


TREATMENTS X LEVELS DESIGNS 


Prepare a complete summary table of the analysis, giving the various 
degrees of freedom, sums of squares, and mean squares. 


b) Compute msp/ms,. Under what conditions (specific to this experiment) 


c) 


is this ratio distributed as F? 


Is this F significant at the 5% level? Define precisely the parent popula- 
tion concerning which inferences may be drawn on the basis of this F- 
test. 


d) Suggest several specific sources of error which are taken into account in 


e) 


D 


g) 


this test of significance. Some which are not. For example, are cage 
differences taken into account? 


Does the F-test of (b) involve the assumption of no interaction between 
diets and levels? Explain. Can the results of the test be more satisfac- 
torily interpreted if one may assume no interaction? Explain. 


What evidence from the data in the summary table shows that the use of 
the treatments X levels design improved the precision of this experiment 
(as opposed to the use of the simple-randomized design)? 


Compute MspL/MS„. Under what conditions (specific to this experi- 
ment) is this ratio distributed as F? Exactly what hypothesis is tested 
by this ratio? 


h) Which type of interaction probably predominates in this experimental 


i) 


» 


situation — extrinsic or intrinsic? Why? (Suppose another experiment 
were conducted in exactly the same manner in all respects except that the 
same diet was given all rats. This would constitute a uniformity trial. 
Why might one find a significant "interaction" in this uniformity trial?) 


Do the conditions of this experiment suggest that one may infer from a 
significant F = msp;/ms, that an intrinsic interaction exists? 


Do the diet means differ significantly (at the 5% level) at level 3? 


k) Suppose the possibility of an intrinsic DL interaction may not be ignored. 


1) 


What useful conclusions may then be drawn from the magnitude of 
F = msp/mspr? 


What is gained by randomizing cage conditions for the entire experi- 
ment? Are the treatment means better estimates of the treatment 
effects than if this had not been done? (Than, for example, if the cages 
had been assigned to locations in the rat room on a haphazard or casual 
basis, rather than strictly according to chance?) Is the apparent pre- 
cision of the experiment (as indicated by the within-cells mean-square) 
dependent on how the cages are assigned to external conditions? 


m) Rather than randomize cage conditions for the entire experiment, might 


it have been better to have made cage conditions as homogeneous as 


n) 


0) 


STUDY EXERCISES 155 


possible within each level, and thus have maximized cage differences 
from level to level, or “confounded” cage conditions with levels? (For 
example, one might have assigned the cages in one level to the three 
“most favorable” locations in the rat room, those in another level to the 
three “next best” locations, and finally those in the remaining level to 
the three “least favorable” conditions — and then have assigned cages 
to treatments at random within each level.) Would the experiment then 
have been better controlled or more precise? Would the apparent pre- 
cision have been improved? Explain. 


It would obviously be inconvenient to feed different diets to rats in the 
same cage. Suppose, however, that this could have been done in this 
experiment. Might it then have been desirable to have assigned rats at 
random to cages as well as to treatments within each level, rather than in 
the manner earlier described? Would cage differences then be taken into 
consideration in the error term and in the tests of significance? 


Suppose that the rats at each level were randomly assigned to six cages 
rather than to three, and that the cages were randomly assigned both to 
treatments and to cage conditions within each level. There would thus 
be two cage totals for each cell of the two-way table. Suppose that the 
analysis were then based on the cage lolals, rather than on the gains for 
individual rats. The total mean square would then have 29 degrees of 
freedom, and that for within-cells would have 15 degrees of freedom, but 
the degrees of freedom for the other mean squares would be the same as 
before. One could then test the treatment effects by F = ms*p/ms*, 
(the asterisks indicating that the mean squares are based on cage totals, 
and are not to be confused with those based on individual gains). In 
what respects would this be a better test than that based on F = msp/msw 
in the original experiment? Would cage differences within levels then be 
taken into consideration in the test of significance? Would any advan- 
tage of the original experiment be sacrificed? 


The Treatments X Subjects Design 


The Generalized Case of the Treatments X Subjects Design 


The treatments X subjects design is that in which the treatments are all 
administered in succession to the same subjects, instead of to different groups 
of subjects as in the simple-randomized and treatments X levels designs. 
The results for a treatments X subjects experiment can always be recorded 
in a double-entry table, in which the columns correspond to treatments and 
the rows to individual subjects. 

The reason for using the treatments X subjects design is usually simply 
to increase the precision of the experiment by eliminating inter-subject differ- 
ences as a source of error. In that case, only one criterion measure is usually 
recorded for each subject for each treatment, although this criterion meas- 
ure may be the mean of a number of independent observations. Another 
possible reason for using this design is to permit a study of the interaction 
of treatments and subjects, that is, to determine if the relative effectiveness 
of the treatments differs from subject to subject. In this case, at least two 
independent criterion measures must be obtained for each subject under 
each treatment, in order to make available a within-cells mean square to 
test the significance of the interaction. Furthermore, the numbers of obser- 
vations for the various treatments must be in the same proportion from 
subject to subject. 


Analysis of the Total Sum of Squares 


When the reason for using the design is simply to increase the precision 
of the experiment, and when only one criterion measure is recorded in each 
cell of the double-entry table, the analysis of the total sum of squares is as 
indicated in the summary table below. The analysis is exactly like that de- 
scribed on page 114 except that A, S, a, and s have been substituted for C, R, 
c, and r respectively. The total number of criterion measures (N) in this 
case is equal to as. T; represents the sum of the criterion measures for the 

156 


TESTING THE SIGNIFICANCE OF TREATMENTS EFFECTS 157 


Jth treatment, and T;, that for the ith subject, and T that for the entire table, 
while X represents a criterion measure. 


Source 


Treatments (A) Ee : ss4/(a — 1) 


Subjects (S) =m sss/(s — 1) 


Treatments X 
Sivas (AS) SSAs = S$p — 884 — SSs | SSAs/(a — 1)(s — 1) 


Total ssp = LX- 


n2 
i=l j=l as 


When one of the purposes of the experiment is to test the interaction, in 
which case there are at least two observations per cell, the analysis is exactly 
like that described on pages 114-115. 

Instances in which the treatments X subjects design is used for the second 
of these two purposes are relatively rare. In general, the experimenter is 
willing to take for granted the presence of a treatments X subjects interac- 
tion and employs the design only in order to increase the precision and effi- 
ciency of the experiment. The subsequent discussion will, therefore, unless 
otherwise specified, be concerned only with the case in which the analysis is 
like that indicated in the preceding table. 


Testing the Significance of the Treatments Effects 


In the treatments X subjects design, the test of the significance of the 
treatments effects is based on the ratio of the mean squares of treatments and 
treatments X subjects (ms4/msas). The conditions! under which this ratio 
is distributed as F are listed below. The last of these conditions constitutes 
the hypothesis to be tested. The first three conditions constitute the as- 
sumptions basic to the test. 


1) The experimental subjects were originally a simple random sample 
from a specified population. (Afler the treatments have been admin- 
istered, the criterion measures for each treatment group may be regarded 
as a random sample from a hypothetical “treatment” population.) 


1 See footnote on page 51. 


158 THE TREATMENTS X SUBJECTS DESIGN 


2) The treatments X subjects interaction effects are normally and inde- 
pendently distributed in each treatment population. (This is necessary 
when a > 2. When a= 2 one must assume that the differences in the 
two criterion measures are normally distributed for all subjects in the 
population.) 


3) The distribution of interaction effects has the same variance in each 
treatment population. (This condition is not necessary with a — 2.) 


4) The means for the various treatment populations are identical. 


To show that ms4/msas is distributed as F, we shall employ the technique 
of arithmetic corrections explained in Chapter 4, pages 110-112. Suppose 
that, for a particular experiment involving a treatments and s subjects, the 
criterion measures are entered in an a X s table, and constant arithmetic 
corrections are applied to all measures in each row, making all row (subject) 
means equal to the general mean. This, as was explained on page 110, would 
eliminate the mean square for S in the corrected table, but would leave the 
mean squares for A and AS the same as in the original table (ms'4 = msa, 
ms’ 45 = Msas). 

Under the specified conditions, the corrected measures in the various 
columns of the once-corrected table may be regarded as constituting inde- 
pendent random samples, all drawn from the same population of similarly 
corrected measures. That is, so far as the corrected measures are concerned, 
the design is a simple-randomized design. Hence, the differences among 
column means in this table may (in consideration of Conditions 1, 2, and 3) 
be tested by means of the F ratio of the mean square for treatments to that 


for within-treatments, that is, by F = msi/mswa. 

The sum of squares for within-treatments (ssj,4) in the once-corrected 
table is the same as that for between-cells (sszj.) in the twice-corrected table 
(pages 111-112). This follows because, with only one observation per cell, the 
sum of squares for within-cells is equal to zero. The number of degrees 
of freedom for within-treatments in the once-corrected table and for between- 
cells in the twice-corrected table is the same [(a—1)(n—1)]. Hence, 
MSia = Msas = MSas. We have previously noted that ms = msa. Accord- 
ingly, we may write 


Fa TA n E af = (a — 1) and (a — 1)(s — 1). (34) 


MS'wa MSas 


We may note that when there are only two treatments (a — 2), it is neces- 
sary only that the differences (D;) between the two criterion measures for 
the various subjects be normally distributed in the population (see page 
119). In this case, the F-test is equivalent to the t-test of the example on 
page 17. That is, when there are only two treatments, there is no problem 
of independence of interaction effects, or of homogeneity of interaction. 


TESTING THE SIGNIFICANCE OF TREATMENTS EFFECTS 159 


Inlerpretation of a Significant F = ms4/msas: If in an actual experiment 
we secure a significant F = ms,/msas, we must, before rejecting the hypothe- 
sis of equal treatment means, assure ourselves that the assumptions under- 
lying this test are valid. 

If the experimental subjects have actually been drawn from a catalogued 
population by use of a table of random numbers, Condition 1 may be regarded 
as completely satisfied. Frequently, however, the subjects available for ex- 
perimental purposes cannot be regarded as a strictly random sample from any 
real population. In that case, we may define a hypothetical population to fit 
the sample, and then restrict our statistical inferences to this hypothetical 
population (see page 74). 

The requirement of normality (Condition 2) seems very likely to be approxi- 
mately satisfied in most psychological and educational experiments. Note 
that we do not assume that the criterion measures are normally distributed in 
the specified population, but only that the deviations of these measures, each 
from the subject’s own mean for all treatments, are normally distributed for 
each treatment. This, as we shall see shortly, comes very close to assuming 
that the errors of measurement are normally distributed — an assumption 
that we have little reason to question. 

The requirement of homogeneous variance of interaction effects (Condition 
3) in the treatments X subjects design is equivalent to the requirement of 
homogeneous within-treatments variance in the simple-randomized design. 
The variance of the interaction effects is what is left of the within-treatment 
variance after the effects of subject differences have been eliminated 
(ssa = SSwa — 88s). The effect of subject differences is the same for all 
treatments; hence, if the variance of the interaction effects is the same for 
all treatments, the within-treatments variance must also be the same fcr all 
treatments. 

In most experimental situations, the degree to which the requirements of 
independence of interaction effects and of homogeneous interaction are met is 
within the limited control of the experimenter. The observed interaction in 
this design, as in the treatments X levels design, is generally in part intrinsic 
and in part extrinsic. The extrinsic error interaction in this case is due to 
what might be described as errors of measurement or observation. These in 
turn are due in part to the characteristics of the test or of the observational 
technique employed to secure the criterion measures, and in part to extrane- 
ous factors or to variations in the environmental conditions under which the 
test is administered. It is over these extraneous factors that the experimenter 
has some control. By employing forms of the criterion test that are as nearly 
as possible equivalent, by keeping the conditions of test administration as 
uniform as possible for all treatments, and by attempting to reduce practice 
or “carry-over” effects, the experimenter may minimize interdependence of 
interaction effects or heterogeneity of interaction. Failure to do these things, 
on the other hand, is likely to result in heterogeneous interaction. Suppose, 
for example, that in an experiment involving four treatments, the Ai and A, 


160 THE TREATMENTS X SUBJECTS DESIGN 


treatments are successively administered to the subjects on one day while A; 
and A, are successively administered on the next day, the criterion measurcs 
being taken immediately following each treatment. Certain extraneous fac- 
tors (such as temperature or humidity) having variable effects on the subjects 
may differ systematically from day to day. This might cause the AS inter- 
action for A; and A; alone to differ from that for A; and A, alone, or from that 
for A; and A; alone or A; and A, alone. Thus the failure to hold constant the 
conditions of test administration might result in a heterogeneous interaction, 
which would tend to render invalid the F-test of significance of the treat- 
ments effect. 

In general, then, the deviations from exact conformity to Conditions 1 
to 3 that are likely to occur in typical educational and psychological experi- 
ments will frequently have no very serious effect on the validity of the F-test 
of the treatments effects. The general effect of minor deviations from Con- 
ditions 1 to 3 will be primarily that of making the probabilities from the 
F-table approximate rather than exact. Generally, the true probability will 
perhaps be somewhat larger than the apparent one, with the effect that the 
results may be pronounced "significant" more often than they should be. 
To the extent that these conditions are in doubt, therefore, the experimenter 
might set higher standards of significance than if they were known to be 
exactly satisfied. 

It is extremely important to note, in all applications of this F-test, that it 
does not take into consideration any systematic effects of extraneous factors 
(Type G errors) associated with the treatments. If a significant F is ob- 
tained, we are nearly always justified in concluding that the differences in 
the treatment means are too large to be due to sampling fluctuations; but 
it does not necessarily follow from this that they are due to the effects of the 
treatments themselves. It is always possible that the observed differences 
in treatment means are due primarily to extraneous factors which have been 
systematically associated with the treatments in the experiment. 


Limitations and Advantages of the Treatments X Subjects Design 


The major limitation of the treatments X subjects design is that suggested 
in the last sentence of the preceding paragraph, with special reference to an 
extraneous factor that is unique to this particular design. When a number of 
treatments are administered in succession to the same subject, the response 
of the subject to any one treatment is often conditioned by the fact that 
other treatments have previously been administered to him. The treat- 
ments X subjects design may therefore be very satisfactorily employed only 
if the treatment effects are temporary, and if the effect of each treatment may 
be assumed to have been entirely dissipated before the next treatment is 
administered. For instance, the design might be employed in an experiment 
to determine the relative effects of various drugs upon a certain "reaction 
time" only if one could assume that before each drug is administered the cumu- 


LIMITATIONS AND ADVANTAGES 161 


lative effects of drugs previously administered have entirely “worn off." 
Furthermore, when this design is employed, the criterion measure for any 
subject following a given treatment may be in part determined by his experi- 
ence in having been measured previously, quite apart from the effect of the 
previous treatments themselves. For instance, an experiment may be designed 
to determine the effect of styles of type on reading rate. The criterion meas- 
ure for the first style of type may be the time required to read, at a given 
minimum level of comprehension, a certain reading passage printed in a given 
style of type. Obviously, in order to measure the effects of another style of 
type on reading rate, the same subjects cannot be asked to read for a second 
time this same passage printed in the second style of type. Rather, an entirely 
different and independent reading passage must be employed to secure the 
second series of criterion measures. Even then, the subjects may be able to 
perform better on the second passage simply because they have had some 
practice in taking this kind of "test" under the experimental conditions. 
More important, the two reading-rate tests may not be equivalent in difficulty. 
If they are not equivalent, the criterion measures for the two styles of type 
might differ, not because of the effect of style of type as such, but because of 
the differences in the criterion tests. The use of the treatments X subjects 
design in such situations, therefore, frequently requires considerable prelim- 
inary work in the construction of “equivalent forms” of the criterion test. 
The cost of this must of course be reckoned in determining the efficiency of 
the design. 

It is obvious that this design can rarely be employed in learning experi- 
ments unless the interest is in the cumulative effects of the treatments, rather 
than in any comparisons of the effects of individual treatments. For instance, 
the “treatments” may really represent increasing amounts of practice in 
the same task, or increasing durations of the same experimental condition, 
and the hypothesis to be tested may be not that the “treatments” means are 
equal, but that the trend in treatment means is linear, or parabolic, etc. This 
special case of the use of the treatments X subjects design in analyses of 
trends will be considered later in Chapter 15. 

When the treatments X subjects design can be employed, it is generally 
far more precise and efficient than the simple-randomized and treatments 
levels designs, since it provides complete control of one of the most important 
sources of variation in educational and psychological experiments — namely, 
differences among individual subjects (inter-subject variations). Further- 
more, the underlying assumption of normality of distribution is more likely 
to be satisfied in the treatments X subjects than in either the simple-random- 
ized or treatments X levels design. In any situation, however, the treatments 
X subjects design shares with the simple-randomized and treatments X levels 
designs an exclusive concern with Type S errors, or a failure to take cognizance 
of Type G and Type R errors. Also, it introduces certain Type G errors 
unique to this design — the effects of rank order and of sequence of treatments, 
which are often of major importance. 


162 THE TREATMENTS X SUBJECTS DESIGN 
Randomizing or Counter-Balancing Sequence and Order Effects 


In many experiments, each treatment is administered on a group basis to 
all subjects simultaneously, in which case, of course, the treatments are ad- 
ministered in the same rank order and in the same sequence to all subjects. 
The treatment taken last may have a decided advantage over the others, or 
may be at a disadvantage, depending on the nature of the cumulative effects 
of the preceding treatments. It should be noted that the results obtained 
under any given treatment may depend not only on how many other treat- 
ments preceded it, but on what particular treatments preceded it, and es- 
pecially on what particular treatment immediately preceded it. The results 
obtained under Treatment A; preceded by Treatments A; and Ag, in that 
order, may not be the same as those obtained under Treatment As when it is 
preceded by Treatment A; followed by Treatment Ai, nor may they be the 
same as those obtained when Treatment A; is preceded by Treatments A, and 
As. In these discussions, the term “order” is merely an abbreviation of “rank 
order.” That is, we will define the “order” in which a treatment is admin- 
istered as depending only upon the number of preceding treatments, but not 
upon the many different sequences in which the same treatment is administered 
in the same order. For instance, As-A;-As and 45-A:-4» represent the same 
order so far as A, is concerned, but not the same sequence. 

An effort may be made to eliminate any syslemalic effects due either to 
order or sequence by administering the treatments in different orders and 
different sequences to different subjects. One possibility is to administer the 
treatments to each subject independently in a purely random order and 
sequence (determined by the use of a table of random numbers). One treat- 
ment may then be slightly favored over another because by chance it drew a 
larger number of “favorable” orders or sequences than the other, but at any 
rate there will be no systematic bias due to order or sequence in the mean 
criterion score for any treatment. If order and sequence effects have thus 
been randomized for each subject independently, the ratio of mean squares 
for treatments and treatments X subjects is still a valid test; however, the 
experiment appears to be less precise than if the treatments are administered 
in the same order and sequence for all subjects. This is true since, in the 
former case, the order and sequence effects contribute to the error interaction, 
whereas in the latter case they contribute only to the treatments mean square. 
In the former case they are included in the error term (where they belong) 
and do not bias the treatment comparisons. Whenever it is practicable to 
do so, this randomizing procedure should be followed in a treatments X sub- 
jects experiment. 

Another way of avoiding bias due to sequence and order effects is to counter- 
balance these effects. This is done by administering the treatments in all 
possible orders and/or sequences, using an equal number of subjects with 
each order or sequence. Suppose, for instance, that there are three treatments, 
Aj, As and A; There are two ways of administering A, first in order 


CONFOUNDING EXTRANEOUS FACTORS WITH "SUBJECTS" 163 


(A-AA; and A;-A5-45), two of administering A; second in order (As-Ai-As 
and A;Ai;-A;) and two of administering A, third in order (Ao-As-A; and 
AsAs-Ai).. That is, there are six possible combinations of order and sequence. 
In an experiment involving 30 subjects, 5 may be selected at random for the 
first combination, 5 at random for the second combination, etc. 

It would be possible also to counterbalance order only, administering 
Ai-Az-As to one-third of the subjects, As-Ai-Ae to another third, and 43-43-41 
to the remaining third, using the same basic sequence (4:-A»-A3) in each case. 
When the order and/or sequence effects are thus counterbalanced, the F-test 
of (34) is no longer valid. In this case, the effect of order and/or sequence 
is not left to chance, but is made exactly the same for all treatments. Never- 
theless, the resulting interaction (due to order and/or sequence) is still re- 
tained in the error term, just as it was when these effects were randomized. 
To use this F-test in this situation would be to make the basic mistake of 
eliminating a source of error from the treatment comparisons without at the 
same time eliminating it from the error term. The result, in general, would 
be to make the results of the experiment seem less significant than they 
really are, and to increase the risk of a Type II error. Appropriate pro- 
cedures for analyzing the results in counterbalanced designs will be considered 
later in Chapter 13. 

Tt should be noted that while the methods of randomizing or counterbal- 
ancing sequence and order effects eliminate bias of one kind, the results 
would still not tell us what to expect from any treatment if it were the only 
treatment employed with a sample of subjects. We might refer to the effects 
of a treatment when it is administered first in order as the “direct” effects 
of the treatment. It should be obvious that the “direct” effect of a treat- 
ment need not be the same as the average of the effects obtained with that 
treatment in all possible orders and sequences. In comparing many types of 
treatments, therefore, the only satisfactory procedure is to use a design such 
as the simple-randomized or treatments X levels design, in which each subject 
takes only a single treatment and all effects are “direct.” 


Confounding Extraneous Factors with “Subjects” 


We noted earlier (pages 146-147) that with the treatments X levels design it 
is often desirable to confound assignable Type G errors with levels. A similar 
situation exists with the treatments X subjects design. Sometimes variations 
in a certain extraneous factor cannot be eliminated, but these unavoidable 
Type G errors may, nevertheless, be assignable to particular subjects by the 
experimenter. For example, certain subjects may be “run” on one day, 
other subjects on a second day, etc., so that “day” effects are confounded 
with “subjects.” Such confounding is clearly desirable if one can be sure 
that the extraneous factor does not interact with treatments. If the ex- 
traneous factor does interact with treatments, such confounding may, never- 
theless, be desirable, but in this case, only if the extraneous effect is also a 


164 THE TREATMENTS X SUBJECTS DESIGN 


random effect. “Days,” for example, may be a random effect in that the 
particular days represented in the experiment may be a random sample 
from a hypothetical population of days. However, if “days” is confounded 
with “subjects” in the sense that n subjects are run on one day and another 
n subjects on another day, etc., then the design must be regarded as a random 
replications design with each day constituting a different replication (see 
Chapter 8). 

A design in which “days” has been confounded with “subjects” is diagram- 
matically represented below. The analysis appropriate for this design is 
explained in Chapter 13, pages 267-273. This analysis permits a test of the 
assumption that there is no AD interaction. If AD is significant and if days 
is a random effect, then the appropriate test of the A-effect is F = MS4/MSap. 
If AD is non-significant and if other considerations permit one to assume 
that AD is non-existent, then an appropriate test for the A-effect is 
F = MS4/MS, i) Whether or not days is a random effect. If AD is signif- 
icant and D is not a random effect, one would ordinarily have relatively little 
interest in the “main” effect of A, but would be concerned rather with the 
simple effects. "That is, this is a situation in which the confounding of days 
with subjects is not desirable. The logic of this design and of the analysis 
appropriate to it will be more fully explained in later chapters. 


A, As As 


Day 3 Su 


Day 4 Sis 


Day 6 Ss 


Testing Differences in Individual Pairs of Treatment Means 


We have seen (page 119) that the interaction sum of squares (sszc) in a 


two-column table is 
Ti 


SSnCc = 3240: EP Dy, 


TESTING DIFFERENCES IN INDIVIDUAL PAIRS 165 


in which D; is the difference in cell means for Row i. From this it follows 
that ifn = 1, 


YD; — DY = 25snc. 
i=l 


If these D;’s are a random sample from a population of D,’s, and if n = 1, 
as would be the case in the treatments X subjects design, an unbiased esti- 
mate of the variance (05) of the population of D,’s is given by 


SO.) 
del 


pe ie = 2ssrc = 
esti on aie Een = 2msnc. 
This is equivalent to writing 
E(msnc) = lob. (35) 


We know that the error variance of the mean (D) of a random sample of 
Djs is given by 


n 


oD, 
Up 
and, from (35), that 


* 2 
est'd op, — 2msre, 


2 
est'd 05 = —- 7 


In the notation of this chapter, D = M. — M2, msnc = msas and r= 8$, 
hence, the preceding expression for (35) may be rewritten 


2msas 
s (36) 


The mean square for interaction (msre = ms.s) in the preceding expressions, 
as derived on page 119, is that computed for Columns 1 and 2 only. How- 
ever, in a multiple-column table, if the interaction is homogeneous the mean 
square for interaction for different pairs of columns can differ only by chance, 
hence a more stable estimate of oor- » is obtained by using in (36) the 
mean square for interaction computed for the entire table, rather than from 
Columns 1 and 2 alone. In this case, the degrees of freedom for the estimated 
TM LM are (s — 1)(a — 1). 

On the assumption that the D;.’s are normally distributed — which would 
follow if the interaction effects are normally distributed — the significance 
of M; — M x may be tested by 


2 
est'd cur yM y= 


p- Ma Ms a. (a — 1)(s — 1). 
2msas 
Ss 


166 THE TREATMENTS X SUBJECTS DESIGN 


Accordingly, the “critical difference” for a selected level of significance may 


be computed by 
M 
AY 2msas. 
d= a meas 


Since, in the treatments X subjects design all treatments are based on the 
same number of measures, only a single "critical difference" need be com- 
puted and all individual differences may be classed as either significant, or 
non-significant by comparison with this critical difference, in the manner 
described on pages 93-94. 


Establishing a Confidence Interval for the True Mean for Any Treatment 


The obtained mean for any treatment in a treatment X subjects experi- 
ment is an unbiased estimate of the mean that would be obtained for the 
entire population if all members of that population had been administered the 
given treatment. For the purpose of establishing a confidence interval for 
this population mean, the obtained treatment mean must be regarded as the 
mean of a simple random sample. The standard error of the mean can then 
be estimated as for any random sample and the usual procedures followed 
to establish the confidence interval for the true mean. 


d : : à i 
The estimated error variance for a single treatment mean is thus — times 
s 


the mean square for within-treatments computed for that particular treat- 
ment alone. 

It should be noted that the estimated error variance (est'd cj) appropri- 
ate for establishing the confidence interval for the true mean of a single 
treatment population may not be employed in a simple t-test of the significance 
of a difference between two treatment means for the same experiment, since 
these means are based on related measures. 


STUDY EXERCISES 167 


STUDY EXERCISES ! 


1. The following data are adapled from a study reported by Chapanis, 
Rouse, and Schachter ? concerning inter-sensory stimulation. One purpose of 
the investigation was to compare the effects of tactile and auditory stimuli 
on the performance of a visual discrimination task. The visual task was 
the reading of the letters of a black-on-gray eye chart under a very faint 
constant illumination. The same chart was used with each treatment. The 
chart contained over 50 letters, The tactile stimulation was effected by 
placing a weight on the back of the subject’s hand. The auditory stimulus 
was simply a fone. The criterion measure was the number of lellers correctly 
identified. 

The subjects were five volunteers from an introductory course in psy- 
chology. They were given a different experimental condition each day for 
five consecutive days. The order and sequence of administration of the con- 
ditions was the same for all subjects. Since the experiment was carried 
out in a darkened room, each subject was given five minutes of dark-adapta- 
tion prior to each experimental session. The room in which the experiment 
was performed was sound-treated and great care was taken to eliminate 
extraneous stimuli. 

The data are presented in the following table: 


Experimental Conditions (C) 
(Number of letters on eye chart correctly identified) 
I Il IH IV V 


Loud Weak Heavy Light > 
Subject Sound Sound Pressure Pressure Control Tolal ^ T./5 


A 21 22 20 22 22 107 2290 
B 22 16 23 19 23 103 2122 
C 14 14 23 24 20 95 1805 
D 29 24 24 24 28 129 3328 
E 16 15 14 15 13 73 1066 
Total 102 91 104 104 106 507 
M 20.4 18.2 20.8 20.8 21.2 
zx 2218 1737 2230 2222 2366 


T./5 300 1656 2160 2163 224 


! See second paragraph on page viii. 
? A. Chapanis, R. O. Rouse, and S. Schachter “The Effect of Inter-Sensory Stimu- 
lation on Dark Adaptation and Night, Vision," Journal of Experimental Psychology, 


vol. 39 (August, 1949), pp. 425-437. 


168 


a) 


b) 


c) 


THE TREATMENTS X SUBJECTS DESIGN 


Complete the necessary calculations and prepare the usual summary 
table of the analysis of variance. 


Define carefully the hypothetical parent population to which the results 
of this experiment may be generalized by the logic of statistical infer- 
ence. Define also a real population to which one might wish to general- 
ize from this experiment on a judgmental basis. Note that in defining 
the hypothetical population it is most important to specify those 
characteristics in which the real and hypothetical populations differ and 
in which the differences are most likely to “interact” with the experi- 
mental treatments. 


At the 5% level of significance, test the hypothesis that the true “con- 
ditions” (column) means are the same. Is there any reason to believe 
that the column means would tend to differ systematically even though 
the same experimental condition had been employed each day (that is, 
are any extraneous factors confounded with the conditions effect in 
this design?) What assumption would have to be made to infer from a 
significant F = msc/mscs that the conditions have different effects? 


d) The F obtained in (c) above was not significant. Explain why this 


e) 


8) 


outcome does not necessarily mean that the observed differences are 
due entirely to chance. Is it conceivable that real and important 
treatment effects might have been cancelled by day or order and sequence 
effects? Aside from this possibility, why is there a large risk of a fairly 
serious Type II error in this situation? 


Must the variance of criterion measures within each “condition popu- 
lation” be the same in order to satisfy the requirements of the F-test 
used in (c) above? Suppose these variances differed considerably; what 
would this indicate regarding the interaction of conditions and subjects? 


What possible a priori reasons can you suggest for expecting hetero- 
geneous interaction in this experiment? 


Distinguish, in terins specific to this experiment, between an extrinsic 
and intrinsic interaction. What major source of error would result 
in an extrinsic interaction? Suppose an intrinsic interaction does 
exist. Does this invalidate the use of the treatments X subjects mean 
square as an error term? Explain. 


h) What, in terms specific to this experiment, is the assumption of normal- 


i) 


ity underlying this F-test? The assumption of independent random 
sampling? Why is the first of these assumptions more likely to be satis- 
fied than the assumption that the criterion measures are normally 
distributed for the population of subjects? 


Establish the 99% confidence interval for the true mean of the control 
condition. For what purpose might one wish to establish this confidence 


interval? 


STUDY EXERCISES 169 


j) What advantage would have been gained by randomizing the order and 
sequence effects for each subject. independently? Tad this been done, 
how would the error terms (mses) probably have compared with that 
here obtained? Why? 


k) Suggest a better criterion measure for use with this experimental design, 
explaining why it is better. 


2. A study by Black! was concerned with the effect of intensity of spoken 
stimuli on the intensity of the listener's oral response. This matter is of 
some consequence in aircraft intercommunication since it is known that 
voice intensity contributes to intelligibility. 

Five 12-word lists, equated for intelligibility, were recorded, one at an 
intensity of —85 db (minimal for understanding), one at —65 db, one at 
—45 db, one at —25 db and one at 0 db (approaching pain level). The words 
in each list were recorded at the same constant rate and separated by five 
second intervals. 

During the experimental session, the subject, seated in an isolated room, 
listened to the recordings through headphones. Immediately upon hearing 
each stimulus word, the subject repeated it into a microphone and the in- 
tensity (in db) of this response was automatically recorded. The order 
and sequence of administration of the lists was randomized for each subject 
independently. The lists were separated by 20 second intervals. The cri- 
terion measure for each subject for each list was the average inlensily of the 
12 words repeated by him. 

The subjects were 25 male college students. Five randomly selected 
subjects were used on each of five successive days. The mean (Mj) of the 
criterion measures for each treatment group and the estimated variance of 
the corresponding treatment population are given below. 


Intensity of Stimulus (1) 
I II III IV V 
(-85db) (-65db) (—45db) (—25 db) (0 db) 


Mean (Mj) 74.02 73.14 75.06 78.38 83.98 
Estimated f 
Population 21.53 13.91 11.63 19.36 24.50 


Variance (ø?) 


1J. W. Black, “Loudness of Speaking: The Effect of Heard Stimuli on Spoken 
Responses,” Journal of Experimental Psychology, vol. 39 (June, 1949), pp. 311-315. 


170 


a) 


b) 


c) 


THE TREATMENTS X SUBJECTS DESIGN 


Summary of Analysis of Variance 


Source 


Intensity of Stimulus (J) 


Subjects (S) 


Intensity X Subjects (ZS) 


"Total 


Add the sums of squares for S and JS, and divide by the sum of their 
degrees of freedom. Compare the result with the mean of the estimated 
c ;"s given in the first of the preceding tables. Explain why this outcome 
js to be expected. 


From the data compared in (a), what would you estimate to be the error 
variance of a single treatment mean in a simple-randomized experiment 
concerned with the same treatments and the same parent population? 
What is the ratio of this estimate to the corresponding error variance in 
this treatments X subjects experiment? According to this ratio, how 
many times as many subjects would have to be used in a simple-randomized 
experiment to make it as precise as this treatments X subjects experiment? 


In a simple-randomized experiment, would you expect ms; to be larger 
than, equal to, or smaller than that in this treatments X subjects 


experiment? Explain. 


d) From (a) preceding it is apparent that the variances (s?) of the treat- 


f) 


ment populations contain a constant component which is due to differ- 
ences among subjects, and a possibly variable component which is due 
to interaction of treatments and subjects. We may assume, in this 
instance, that the between-subjects component is the larger one in each 
treatment-population. This being the case, why do the data in the 
first of the preceding tables suggest that there is a heterogeneous inter- 
action in this experiment? 

What a priori reason can you give for expecting a heterogeneous inter- 
action in this experiment? (Is it reasonable to suppose that some sub- 
jects are less "suggestible" than others, and that the replies of some 
subjects will be of nearly constant intensity for all stimulus intensities, 
while the replies of others will vary in intensity with the intensity of 
the stimulus?) 

Why, in spite of the likelihood of a heterogeneous interaction, can you 


be practically certain that the "treatments" (or whatever extraneous 
factors may be associated with them) have a real effect in this experi- 


ment? 


g) 


h) 


i) 


3 


k) 


1) 


m) 


STUDY EXERCISES 171 


Can you suggest any types of extraneous factors that might be con- 
founded with treatments (stimulus intensity) in this experiment, and 
that alone might account for the significance of F = ms;/msrs? 


Describe the operations by which the randomization of order and 
sequence for each subject independently might be accomplished in an 
experiment of this kind. 


What systematic variations in extraneous conditions from day-to-day 
could conceivably affect mss without affecting the other mean squares, 
and thus not disturb the validity of the F-test of (g)? What other 
kinds of day-to-day variations might affect the other mean squares 
(ms; and msrs)? Accordingly, what assumptions must be made about 
day effects in applying the F-test of (g)? 


Why is there usually little point in testing the hypothesis of no subject 
differences in an experiment of this kind? What extraneous factor is 
confounded with subject differences in this design? Is this desirable? 
Why? 


How could the observations be analyzed so as to yield an error term 
for testing the significance of the JS interaction? Why is there usually 
little interest in this test? 


May one conclude from the significant F of (g) that the response level 
is monotonically related to the stimulus level? 


Determine the critical difference between any two treatment means. 
What conclusions may one draw from the conditions means regarding 
the relationship between the criterion and the stimulus level? (Note: 
Better ways of analyzing the data for trend will be considered in Chap- 
ter 15.) 


Is there any reason to believe that if the purpose of this experiment had 
been only to test the hypothesis of no treatment effects, this purpose 
could have been served with considerably fewer subjects than were 
used? (Actually, the major purpose of the experiment was to investigate 
the trend in treatment means — and this experiment will be reconsid- 
ered with reference to this purpose in Exercise 2 of Chapter 15.) 


The Groups-Within-Treatments Design 


The Generalized Case of the Groups-Within-Treatments Design 


Sometimes the population in which the experimenter is interested consists 
of a large number of groups or subpopulations, only a relatively small num- 
ber of which may be represented in any single experiment. Sometimes, too, 
each group is such that if it is to be used at all in an experiment it must be 
used intact — that is, it may not be practicable to use only a part of the 
group. The most important instance of this kind in educational and psy- 
chological research is that in which the groups represent classes of school 
pupils, or in which the subpopulations correspond to schools. If a research 
worker wishes to experiment with school pupils, he frequently must work with 
classes already organized in the cooperating schools. To reorganize the classes 
for experimental purposes may not be administratively feasible. Further- 
more, to use only a part of the class would raise the administrative problem 
of what to do with the rest of the class in the meantime. Usually, for this 
and other reasons, it is easier to use the entire class than only a part of it. 

Experiments performed with selected groups or subpopulations fall into 
two general types. One is the type in which it is impracticable to administer 
different treatments simultaneously to different members of the same group, 
and in which each treatment is therefore administered in an independent set 
of groups. The other type is that in which it is possible to replicate the 
complete experiment — that is, to make all the treatment comparisons — for 
each of the selected groups or subpopulations independently. This chapter 
will be concerned with the first of these types. The latter type will be con- 
sidered in the succeeding chapter. 

Two illustrations of the first of these types of experiments should suffice 
at this point. Suppose that the purpose of an experiment is to determine 
which of several methods of teaching spelling in the fifth grade is most effec- 
tive. Suppose also the methods are such that if all were used simultaneously 
or successively with different classes of pupils in the same school, there would 
be serious danger that some methods would be “contaminated” by the 
others through a possible exchange of information concerning the methods 

172 


THE GENERALIZED CASE 173 


among the teachers and pupils in the same school. It may therefore be 
necessary to use Method A, in one set of schools, Method A; in another and 
independent set of schools, etc. 

As another illustration, suppose an experiment is to be performed with 
rats, and for practical reasons it is necessary to keep the rats in cages through- 
out the experiment. Suppose, furthermore, the treatments are such that 
it is impracticable simultaneously to administer one treatment to some 
rats and other treatments to other rats in the same cage. All rats in the 
same cage must be given the same treatment, and each treatment is therefore 
administered to the rats in independent sets of cages. In such an experiment, 
systematic differences may arise between cages due to differences in the loca- 
tion of the cages in the rat room, to possible non-detected, infectious illnesses 
affecting all rats in certain cages, to the presence of “neurotic” rats in cer- 
tain cages, etc. Hence, the cage rather than the individual rat must be 
regarded as the unit of sampling in the experiment, or as a primary source 
of error variations, and the test of significance of the treatment differences 
clearly should, if possible, take such errors into consideration. 

In this chapter we shall be concerned only with the case in which each 
of the groups is of the same size, or in which it is desirable to give all the 
groups the same weight in the treatment comparisons even though they differ 
in size. In the experiment with various methods of teaching fifth-grade 
spelling, for example, the experimenter might very well wish to determine 
the relative effectiveness of the methods for the “average school" rather 
than for the “average pupil.” Even though the number of pupils involved in 
his experiment may differ from school to school, he may wish to give all 
schools the same weight in computing the treatment means. That is, he may 
wish to use unweighted means of the school means as measures of effective- 
ness of the treatments, rather than the weighted means. This is especially 
desirable if there are large differences in the effectiveness of the same method 
for different schools, or if the results for any school are markedly influenced 
by extraneous factors (such as the teacher used with the method) which 
are inextricably associated with the treatments, and which may be unre- 
lated to size of school. In this case, it would seem clearly undesirable to 
give these factors, or the characteristics unique to individual schools, more 
weight in some schools than in others, simply because the number of pupils 
happened to be larger in some schools than in others. 

A simple analogy may be helpful at this point. Suppose one wished to 
describe the distribution of intelligence for a group of pupils. Suppose that 
for some pupils several independent measurements of intelligence had been 
made with different equivalent forms of the same test, or with different 
lests, and that the number of scores differed markedly from pupil to pupil. 
For example, suppose that in a group of five pupils, one pupil had taken 8 
different intelligence tests, another 4 tests, and each of the others 2 tests 
each, making a total of 18 scores. In this case, it is hard to conceive of any 
practical purpose for which one would be interested in the simple mean of 


174 THE GROUPS-WITHIN-TREATMENTS DESIGN 


the 18 scores. Rather, one would compute the mean of the available scores for 
each pupil individually, and then describe the average pupil in terms of the 
simple (unweighted) mean of the five averages. 

The situation is much the same in experiments designed to determine which 
of several methods of instruction should be recommended for schools in gen- 
eral. The school is the unit of sampling. The interest is in what is true of 
the treatments in the average school, or in the average class. In the case in 
which the various groups are organized by the experimenter rather than 
used by him as they are found, it is usually desirable in any event to make 
all groups of the same size; hence in such situations the problem of varying 
groups is of minor importance, 

When all groups are the same size, it matters little whether the group 
or the subject is regarded as the unit of analysis, since in this case the weighted 
and unweighted means of the group means are the same, There are some 
advantages, however, in regarding the subject as the unit of analysis, Ac- 
cordingly, both possible analytical procedures will be presented, 


The Analysis of Variance in Groups-Within-Treatments Designs 


(The Subject as the Unit of Analysis) 


We shall first see how the total sum of squares may be analyzed into its 
components. Since it is almost as easy to do this in the general case of vari- 
able groups as in the case of constant n’s, we shall not at this point impose 
the restriction of uniform size of groups. 

In any experiment of the groups-within-treatments type, the total experi- 
mental sample may be regarded as consisting of a number (a) of sels of groups 
(corresponding to treatments), or of a total of k groups. Using the methods 
of simple analysis of variance into two components, we may, by disregarding 
the groups, analyze the total sum of squares (ss7) into its between-treatments 
(ssa) and within-treatments (ss,.4) components: 


SST = 884 + SSwa. (37) 


Similarly, by disregarding the sets (treatments) and considering only the 
groups, the total sum of squares may also be analyzed into its between-groups 
and within-groups components, as follows: 


SST = SSG + SSug. (38) 


Also, within any given treatment set, say Treatment j, the total sum of 
squares (ssr,) for that set alone may be analyzed into its between-groups 
(ssa ) and within-groups (S8we ;) components: 


SST; = SSG, + SSwG je 


THE ANALYSIS OF VARIANCE 


Summing such expressions for all a sets or treatments 


a 


â É 
20 = Lissa, + 25550; 


j=l jl 


a a 
which, if we let = and - may be written 
, ise SSG P» j^ S8wa, MAY 


SSwa = SSawa + SSug. 
Hence, from (37) and (39), 
SST = 884 + SSqwa + SSug. 


175 


(39) 


(40) 


To compute the sum of squares for groups-within-treatments (ssg.a), we note 


from (38) and (40) that 

88g = $84 + S$GvA 
from which 

$8GwA = SSG — S84. 


(41) 


The same result as in (40) may be obtained by algebraic methods. For 


this purpose, we will let 


X = any measure 

a = number of treatments 

k = total number of groups (within treatments) 
M = general mean 

M; = mean of the jth set or treatment 


Mi; = mean of the ith group in set j 
ni; = number of cases in the ith group in set j 
n; = number of cases in set j 


u; = number of groups in set j 
N = total number of cases 
Qt (X= Mij) 

d= (Mi; Mj 

e — (M;— M). 


We may then write as an identity, 
(X-M)2 c4 de. 
Squaring both sides, we get 
(X — M} = + d! +e + 2ed + 2ce + 2de. 
Summing for the ith group for treatment j (letting X indicate 5 
E(X — M) = Ze + nid! + nije? + 2dZc + Bede + 2n;jde. 


Summing for all u; groups in set j and noting that Ec = 0 


uj uj uy uj 
SYK - MP = DD + Mna + nye’ + 2e? nid. 
i=l i=l to) i- 


i=l 


176 THE GROUPS-WITHIN-TREATMENTS DESIGN 


Summing for all a sets and noting that Yn jd=0 


a "j Chr ash} a 
ELE MP -ELE Eled Yee. 
j=l i= j=l i= i 


j=l i= 


or 
SLEA- My = EDEA- Ma + Ý EmMa- Mi + 


j=l i 
a 
Don(M; — M); 
j=1 
or 
SST = SSuG + 88qwa + SSA. 


The number of degrees of freedom for ssr is, of course, N — 1. The number 


a "j 
of degrees of freedom for sswg is 275 (ni; — 1) = (N — k). The number of 
C aE i 


degrees of freedom for ssgwa is Du; — 1) = (k — a). The number of degrees 
of freedom for ss4 is, of course, (a —1). The analysis of the total degrees of 
freedom is therefore 
dfr = dfi + dfava + dfa 
or 
(N — 1) = (N — k) + (k — a) + (a — 1). 


Computational Procedures (Subject the Unit of Analysis) 


The procedures for computing the various components of the total sum of 
squares are the same as those employed in simple analysis of variance. 


ssr = SEEN - TN 


j=l i=l 


Ssa = Ty - TUN 


a “j 


Y YT/n;- T/N) 


j=l i=l 


{I 


(sse 
a 194i a “j 
800 = ssr — 886 = DX — 222, Tsi/nai 
Fiti fli 
- 


a tj a 
SSGwA = SSG — SSA = XXn = Yn; 
fe ri 


Thus, the computational results may be summarized in a table as follows: 


ANALYSIS OF UNWEIGHTED GROUP MEANS 177 


Source of 


Variation Sum of Squares Mean Square 


Treatments ssa = > T5/n;— T'N msa = $84/ (a — 1) 
j=l 


G © ithi t a “j Li 
KOUPE WIL — a |ssoua= X) o Tu/nu— 2; T/n; |m$aca = 88awa/(k— a) 
pe 


"Treatments pee 


e SET aum 23 
Subjects (with: ERE 3 yx- Y Y Tn) Mya = $8va/ (N — k) 


in-groups) elfe Aa 


Total 887 = YYrx- TN 


j=l i=l 


Analysis of Unweighted Group Means (The Group as the Unit of Analysis) 


When the number of cases varies from group to group, but when it is never- 
theless desirable to give each group the same weight in the treatment com- 
parisons, the analysis is based only on the means of the various groups, all 
the means being treated alike regardless of differences in the numbers on 
which they are based. Each treatment mean is then the unweighted mean of 
a simple random sample of group means. In this case, therefore, the groups- 
within-treatments design reduces to a simple-randomized design. We will 
identify with an asterisk (*) any sum of squares based on unweighted group 
means. Accordingly, the total sum of squares (ss), i.e., the sum of squared 
deviations of the group means from the general mean (for which the 
df= k— 1), is analyzed into two components, the (between) treatments 
(ss*) and. groups-within-treatments (555,4) components. The computational 
results may be summarized in a table as follows: 


Source of 
Variation Sum of Squares Mean Square 


Treatments a—1 ms% = ss*/(a — 1) 


Groups-within- hi T ^ 
x = =$ k—a 
"Treatments ad MS Guam $85, 4/ ( ) 


Total Rll esses S Mi; - T'/k 


j=l i=l 


a My. 


in which T = X} M;; 


j=l t=) 


178 THE GROUPS-WITHIN-TREATMENTS DESIGN 


It should be noted that a given sum of squares based on the unweighted 
group means is not the same as the corresponding sum of squares based 
on the individual measures. lf the groups are of the same size (n), then ssa 
and ssgva derived from the individual measures are each n times the corre- 
sponding sum of squares based on the unweighted means. That is, ssa = 
n(ssa) and ssqwa = n(ssGwa). The proof of this is left as an exercise for the 
student. If the groups vary in size, the conversion factor is the harmonic 
mean of the group n's, i.e., ssa = fi(ss%), etc., fi being estimated by formula 
(49) on page 184 when the number of groups is large. 


Test of the Hypothesis of Equal Treatment Means (Unweighted) 


We have already seen that when the analysis is based on the group means, 
that is, when the treatment comparisons are based on the unweighted means 
of the group means, the groups-within-treatments design reduces to a simple- 
randomized design. Accordingly (see page 55), the hypothesis of equal 
treatment means is tested by 


F = ms*/ms?, » df = (a — 1) and (k — a). 


We have also seen that when the groups are of uniform size, ssa = n(ss*) 
and $$ gua = n(555,4)- Accordingly, 


* 
MSA ms 
F= A BUG Tm (42) 
msg, a 


That is, the test of the hypothesis of equal treatment means may be based 
on the ratio of mean squares for treatments and groups-within-treatments, 
whether these mean squares are derived from the analysis of unweighted 
means or from the analysis of individual measures. 

If the number of cases differs from group to group, ms4/msqwa is not dis- 
tributed as F. In this case, if one wishes to test the hypothesis that the 
weighted mean of the group means for the entire population is the same for 
all treatments, one must use other procedures that cannot be considered 
here. 

Interpretation of a Significant F = ms$/mss,,: The conditions under which 
ms*/ms$, , is distributed as F have been given on page 73. Translated to 
the groups-within-treatments application, these conditions ! are as follows: 


1) From a population consisting of a very large number of groups, a sel 
of groups has been selected strictly at random for each treatment 
independently. 


1 See footnote on page 51. 


TEST OF HYPOTHESIS OF EQUAL TREATMENT MEANS 179 


After administration of the treatments, the groups that received 
each treatment are regarded as a random sample from a hypothetical 
treatment population. 


2) For all groups in each treatment population, the group means are 
normally distributed. 


3) The variance of this distribution of group means is the same for each 
treatment population. 


4) The (unweighted) mean of the group means is the same for each treat- 
ment population. 


Before he may reject the hypothesis of equal treatment means (Condition 4) 
on the basis of a significant F, the experimenter must satisfy himself that 
Conditions 1-3 have been met —at least closely enough that the sampling 
distribution of the mean square ratio will not be appreciably affected by 
the failure to satisfy them exactly. 

The first condition, that of random selection of groups, is one which often 
cannot definitely be shown to have been satisfied. This is particularly 
true in educational experiments in which the groups correspond to classes 
or schools, In such cases, the experimenter would usually like to test some 
hypothesis about a real and specified population of schools, but very seldom 
is he able to select the experimental schools strictly at random from that 
population. Ordinarily, he must solicit the necessary cooperation on a per- 
sonal basis, or on the basis of institutional relationships. For example, a 
research student may induce some personal friends who are school superin- 
tendents or principals to give him the necessary experimental facilities in 
their schools as a personal favor; or a research professor in a university 
may seek cooperation among schools known to have a sympathetic attitude 
toward research in general, or to his university in particular. In such situ- 
ations, there may frequently be quite marked differences, on the average, 
between the schools accessible to the experimenter and those not accessible. 
The experimenter may be able to make a random selection from the list of 
accessible schools, but the sample is, of course, still a biased sample so far 
as the whole population is concerned. He may be able to reduce much of 
this bias by selecting from the accessible schools a sample that, with refer- 
ence to certain control variables such as size and type of school, geographical 
location, annual per pupil expenditures, etc., is approximately representative 
of the whole population in which he is interested. 1f the group of schools 
used with each treatment is thus made representative for each treatment 
independently, Condition 1 will clearly be violated, even though the bias 
may have been reduced. To avoid such bias, and at the same time to satisfy 
Condition 1, the experimenter may make his total experimental sample 
representative of the whole population as indicated above, but then assign 
these selected schools at random to the various treatments. He could then 


180 THE GROUPS-WITHIN-TREATMENTS DESIGN 


resort to the device of defining a hypothetical population that will fit his 
sample, the population being defined roughly as schools of the type accessible 
for experimental work in general, but otherwise representative in certain 
respects of schools in general. The schools assigned to the various treat- 
ments may then be regarded as random samples from this hypothetical 
population. The experimenter should consequently restrict any statistical 
inferences drawn from the experiment to this hypothetical population. 
If any further inferences are drawn concerning the real population, these 
will be drawn without the usual safeguards of the logic of statistical infer- 
ence, and must rest on the subjective judgment and experience of the indi- 
vidual drawing the inferences. 

In most educational experiments of the type just considered, the mean 
obtained for any given treatment in the experiment must be regarded as a 
decidedly biased estimate of the corresponding mean for the real population, 
and any estimate of the standard error of the obtained mean will be a dubious 
basis for testing any hypothesis concerning the (real) population mean for 
that treatment. Fortunately, however, in most experiments we are not so 
much interested in estimating the population mean for the given treatment, 
as in estimating the rank order of the treatments on the basis of their effective- 
ness for the whole (real) population. For this purpose, it does not matter 
if all obtained treatment means are biased, so long as all are equally biased 
in the same direction. In other words, what really matters is whether or not 
there is any inleraclion between treatments and the differences between the 
real and hypothetical populations, or between treatments and the differences 
in accessible and non-accessible schools. All the treatments may do better 
with the accessible schools than they would with non-accessible schools, but 
there may be no reason to suppose that any one treatment will do relatively 
better than any other for either group of schools. That is, if the null hypothe- 
sis may be retained for the hypothetical population, one might reasonably 
contend that it may also be retained for the real population. The crucial 
question, then, is whether or not any of the possible differences between 
the selected schools and those not selected are likely to affect the responses 
to some of the treatments more than to others. If not, Condition 1 need 
not cause serious concern. 

In certain other types of experiments, Condition 1 may often be regarded, 
for all practical purposes, as completely satisfied. In the illustrative experi- 
ment with rats (page 173), for example, the rats may originally have been 
drawn at random from a homogeneous and well-defined stock of rats, and 
may have been assigned to the cages on a strictly random basis. The various 
sources of systematic differences between cages (Type G errors), such as 
positions of cages in the rat room, could then be strictly randomized. with 
reference to cages and treatments. The particular cages under each treat- 
ment could then be fairly regarded as strictly a random sample from a hypo- 
thetical population of cages — consisting of an indefinitely large number of 
rats from the given stock thus assigned to cages and given the specified 


TEST OF HYPOTHESIS OF EQUAL TREATMENT MEANS 181 


treatment. In such an experiment, a significant F could hardly be attributed 
to any failure to satisfy Condition 1 in the experiment. 

So far as Conditions 2 and 3 (normality and homogeneity of variance of 
group means) are concerned, the considerations are almost exactly the same as 
those reviewed on pages 73-78. In most educational and psychological ex- 
periments of this type, the distribution of group means will at least roughly 
approximate the normal distribution. Results from wide-scale testing pro- 
grams have shown repeatedly that the distributions of mean scores (by 
schools) on psychological and educational tests usually show a roughly bell- 
shaped distribution for large numbers of schools. Furthermore, general ex- 
perience indicates that distributions of randomized errors, such as those exem- 
plified in the rat experiment, are typically distributed in an approximately 
normal form. Finally, means of random samples almost always tend to be 
more nearly normally distributed than the individual measures on which they 
are based. "Thus, Condition 2 (normality of group means) could be closely 
satisfied even though the individual measures in each group show markedly 
skewed or otherwise non-normal distributions. 

Whether or not Condition 3 (homogeneity of variance of group means) is 
satisfied depends on the relative effects of each treatment upon the mean 
and variance of the distribution of criterion measures. As was suggested 
on pages 76-77, the induced differences in the treatment means in many experi- 
ments are small when measured in terms of the ø of the distribution. In 
such cases it is reasonable to suppose that since the treatments did not result 
in marked changes in the population means, neither could they bring about 
any marked differences in the variances of the group means. Condition 3, 
however, should never be taken for granted. Sometimes the criterion meas- 
ures are such that before administration of the treatments, or at the begin- 
ning of the experiment, both the mean and variance of the group means 
of the criterion variable are zero or near zero. This might happen in a learn- 
ing experiment in which the “treatments” really represent different durations 
of the same treatment, or different amounts of practice with the same pro- 
cedure, or different numbers of applications of the same treatment, and in 
which the criterion measure is gain or improvement on a criterion test. In 
this case, there is likely to be a marked correlation between the means and 
variances of the group means for the various treatments, and the assumption 
of homogeneity of variance may be far from satisfied. 

In the general situation, then, Conditions 1, 2, and 3 are likely to be only 
approximately satisfied — how closely the experimenter may be unable to 
say with much accuracy. The more extreme the departure from these con- 
ditions, of course, the less accurate, and possibly the more biased, are the 
probabilities read from the F-table. The minimum level at which the obtained 
F must be "significant" before the null hypothesis concerning treatment 
means is rejected should therefore depend on the experimenter's judgment 
as to the extent to which these conditions have been satisfied. The less his 
confidence in these conditions, the higher should be the level of significance 


182 THE GROUPS-WITHIN-TREATMENTS DESIGN 


demanded. Unfortunately, no clear-cut rules can be offered to guide the 
experimenter in his selection of this level of significance, but considerable 
help should be secured from the Norton study (see pages 78-86). 


The Groups Considered as Random Samples from Corresponding 


(Hypothetical) Subpopulations 


Thus far in this chapter we have regarded the populations in which we 
are interested as consisting of a large number of groups. While each group 
is finite (and usually small), the population has been considered as consist- 
ing of so large a number of groups that, for purposes of error analysis, it may 
be regarded as of infinite size. "The sampling has been by groups rather 
than by individuals, each group having been used intact. That is, we have 
not regarded the group itself as a random sample, and have made no assump- 
tions about the form or variance of the distribution of criterion measures for 
the individual groups. 

In many practical applications of the groups-within-treatments design, 
however, it is possible and plausible to regard each group as itself a random 
sample from a corresponding subpopulation. For example, the pupils now 
enrolled in the sixth grade in a particular Iowa public school might be re- 
garded as a random sample from a hypothetical subpopulation corresponding 
to this school, the subpopulation consisting of an indefinite number of “ simi- 
lar” pupils who might pass through the sixth grade in this particular school 
under essentially the same (stable) conditions. If the pupils in each school 
are thus regarded as a random sample from a corresponding hypothetical 
subpopulation, then the “population” of pupils now enrolled in the sixth 
grades in all Iowa public schools would itself become a representative sample 
from a hypothetical population comprised of the various hypothetical sub- 
populations corresponding to the schools. 

As we shall see, the test of significance (42) already described (page 178) 
remains valid whether we regard the entire population as consisting of small 
and finite groups (which are drawn intact) or as consisting of a large number 
of hypothetical and infinite subpopulations. The latter way of viewing 
the population, however, does present certain advantages which we shall 
consider in the following sections. 


The Expected Values of ms, and msc,A 


Let us first consider the case in which n:;=n is a constant. For any 
given one of the treatments, say Treatment j, the obtained mean (M;;) of any 
group may be regarded as consisting of two parts — the true mean (u;;) of 
the corresponding subpopulation, and an error (Mi, — p:;) due to random 
sampling from the subpopulation. Thus, for Subpopulation i in Treatment J 


Mi; = mii (Mi — pii). 


THE EXPECTED VALUES OF ms, AND msc, 183 


Subtracting u; the true mean for Treatment j, from both sides of this expres- 
sion, we get 
(Mi; — nj) = (ni — n) + (Mi; — oii). 


Squaring both sides, summing for all values of i and dividing by uj, (the 
number of groups in set j), we get 


x M x d 
Mi — ni) Qo lois =u) 22 asi — uj (Mi — uii) iM - wii)” 

Uj DE Uj H u; "s uj i 
Letting u; approach infinity, we note that the left-hand term above becomes 
the true variance of the group means for Treatment j, which we will represent 
by cj, The first right-hand term becomes the true variance of the sub- 
population means for Treatment j, which we will denote by c* . The second 
right-hand term becomes zero, since there can be no correlation between 
(us; — nj) and (Mi; — ij). Finally, the last term becomes the error variance 
of a single group mean, which, if all subpopulations have the same variance 
(g?), becomes equal to c?/n. Accordingly, the preceding expression may 
be written 

95, = oyt a Yn. (43) 


Let us now assume that oj, is the same for all treatments, and that c? is 


also constant. We will let o? and c? represent these common variances for 
the treatment populations. It then follows from (43) that 


0, — 0; c /n. (44) 


If n were variable, the last term of (44) would be c?/fi;; in which ñi; is the 


harmonic mean of all the group n's. 
“i 


Now we know that in any experiment, 5'(M;; — M;)'/(u; — 1), for treat- 


del 
ment A; alone, is an unbiased estimate of o2, or that ei is its expected value. 
Hence, when nj; = n is a constant, 


uj 
2 ni(M;; — Mi) 
Den RU. , 
is an unbiased estimate of no?. That is, 


E(msqwa) = no, = no; +0°(= fijos + o^). (45) 


Thus we see that the “error” term (msgws) used in the test of significance 
(42) suggested on page 178 takes into consideration two types of sources of 
error: differences among groups or subpopulations as measured by 02, and 
differences among individual subjects as measured by o°. 


184 THE GROUPS-WITHIN-TREATMENTS DESIGN 


We already know (see page 60) that msg provides us with an unbiased 
estimate of o?. That is, 
E(mswe) = 0°. (46) 


We are thus provided from (45) and (46) with a way of estimating o% from the 
results for a single experiment, as follows: 
MSGwva — MSwG | 


est'd of = 
n 


(47) 


This expression has been obtained for the case in which n;; = n is con- 
stant for all groups. When n;; is variable, an estimate of ct is given by 


est'd o? = Uere d (48) 
nij 
in which 
a “j 
Ijeu Eg 
fis" 1—1 Ly SS (49) 
j=l i=l a p 
»»7 
j=l i=l 


k being the total number of groups used with all treatments (k= Ju). 
j=l 
In a manner similar to that employed in the preceding proof, iit may be 
shown (proof will not be given here) that 
ua 
E(ms4) = 0? + fijos + fij — 
=l 
in which (u; — u) represents the deviation of the true mean (u;) of Treatment 


A; from the mean (u) of y;’s, and in which fi;, the harmonic mean of the n,’s, 
is estimated (when a is large) by 


(50) 


np A (51) 


When both ms4 and msqwa have their expected values, we may write 


Diu: — p)? 
Ac ium 
E(msa) te o? + fiio; + a-l 62) 
E(msewa) c? + fijos 1 
which makes it apparent that F = msa/msewa tests the hypothesis that 
Sug = 5s — 0. It is apparent from (52) also that the “treatment effect" 
j=l 


(msa) is due in part to sampling fluctuations resulting from the random 
selection of subjects from individual subpopulations (o°), in part to sampling 
fluctuations resulting from the random selection of subpopulations or groups 


MEANING OF F= msc, ,/ms,c 185 


(62), and possibly also in part to real differences among the treatments 
2204 — p): 
j=l 


Meaning of F = ms,4/msy¢ 


By regarding each group as a random sample from a corresponding (hy- 
pothetical) subpopulation, we may also attach a useful probability meaning 
to msqua/MSwa. 

We have already noted that, by disregarding treatments and considering 
only groups, the total sum of squares for within-treatments in a groups- 
within-treatments design may be analyzed into two components, those for 
groups-within-treatments and within-groups [see (39), page 175). 

It may be readily shown in the same manner as with the simple-randomized 
design (see page 53), that the ratio between the mean squares for groups- 
within-treatments and within-groups is distributed as F on the conditions 
stated below. The last of these conditions constitutes the hypothesis to be 
tested, and others represent the assumptions underlying the test. 


1) The distribution of the criterion measures for each subpopulation is 
normal. 


2) The variance of this distribution is the same for all subpopulations. 


3) The “group” taken from each subpopulation is a random sample from 
that subpopulation. 


4) Within each treatment set, the subpopulation means are the same. 


The important considerations in the interpretation of this F are the same as 
those discussed on pages 73-78. In most situations, it may be safely assumed 
that Conditions 1 to 3 are satisfied within limits close enough to leave the 
F-distribution essentially undisturbed. Accordingly, a significant F usually 
means that Condition 4 is false, or that there are real differences among the 
subpopulations in one or more of the populations (treatments). 

We may observe from (45) and (46) on pages 183-184, that if Conditions 
1 to 3 are satisfied and both msgua and msug have their expected values, 


F- m$GwA _ g ne (53) 
m$vG P 
Hence, this ratio is a test of the hypothesis that c7 = 0, which is of course the 
same as the hypothesis that within each treatment the true subpopulation 
means are identical. 

In general, with groups-within-treatments designs, there should be little 
interest in this hypothesis, since the choice of this design is presumably based 
on the knowledge that there are real differences among subpopulations (either 
due to the treatments or to extraneous factors whose effects are systematic for 


186 THE GROUPS-WITHIN-TREATMENTS DESIGN 


all subjects in the same group). If differences between groups or subpopula- 
tions could not be assumed, the total sample for each treatment might as well 
be regarded as a simple random sample, and the method of analysis for simple- 
randomized designs employed. However, in some situations, the groups- 
within-treatments design might be used initially because differences between 
groups are suspected but not known to exist. The test F = msewa/MSwe might 
then be made. If this F proved nonsignificant, we might then, on the assump- 
tion of no group differences, regard the design as a simple-randomized design. 
However, the assumption of no group differences should be supported by a 
priori considerations as well as shown tenable by the test based on the ratio 
msgea/mssg. In this case, the sums of squares for groups-within-treatments 
and for within-groups could be added together to give the sum of squares for 
within-treatments, and the sum of these sums of squares could be divided by 
the sum of their degrees of freedom to give the mean square for within-treat- 
ments. This mean square would then be used as the error term in testing the 
significance of the treatment differences. 

It may be well to emphasize at this point that if, in the groups-within-treat- 
ments design, we regard the entire experimental sample as consisting of a 
number of randomly selected intact finite groups, rather than as a number of 
random samples drawn from randomly selected subpopulations, the test of 
significance given by (42) is valid even though the conditions stated on page 
185 do not apply. That is, these conditions need not be regarded as assump- 
tions underlying the test of significance of the treatments effect. 


Precision of Individual Means and of Differences in Pairs of Means 


The error variance (Cir) of a treatment mean is given by 


For a constant nij, MScwa (see page 183) is an unbiased estimate of no?. Hence, 


ms ms 
GvA . Get dj = k— a. 


2 
est'd oy; = oA 5 
i i 


If nj; is variable, but all group means are to be given equal weight, cir, may 
be estimated by 


* 
ms 
estd cy, = — df = k— a, 


i 


msžwa computed as shown on page 177. 

Given either of these estimates of eir; it is possible, by procedures already 
familiar to the student, to establish confidence intervals for individual treat- 
ment means or to test the significance of differences for individual pairs of 


treatment means. 


STUDY EXERCISES 187 


If n; is variable and the treatment means are weighled means of the group 
means, other procedures * must be followed for these purposes. 


General Advantages and Limitations of the Groups- 


Within-Treatments Design 


Attention has already been drawn to one very marked advantage of the 
groups-within-treatments design over the simple-randomized design and the 
other designs thus far considered. This advantage is that the error term 
(msgy4 Or ms&,4) takes into consideration not only the fluctuations resulting 
from random sampling of subjects (Type S errors) but also the result. of ex- 
traneous factors having a systematic effect on all subjects within the same 
group. This is apparent from (45) on page 183, which shows that msqwa de- 
pends in part on differences among subjects (c?) and in part on differences 
among groups (02). The differences among groups (c?) may be due in part to 
extraneous factors, such as differences in teachers employed with the same 
method in different schools, that is, to Type G errors, and in part to differences 
among subpopulations which are really characteristic of the treatments, that 
is, to Type R fluctuations. In the process of taking a random sample of schools, 
a random sample is also taken of each of these types of error. Thus, the 
“error term” (msgwa) used in the groups-within-treatments design takes into 
consideration all three types of error. Furthermore, the use of this design 
avoids any possibility of “contamination” of any one treatment by another. 
These are extremely important advantages of the groups-within-treatments 
design. Unfortunately they are accompanied by the serious disadvantage of 
relatively low efficiency, particularly if the differences between groups (e?) are 
large, or if msgwa is very much larger than msvc. For this reason, when 
“contamination” is not a serious issue, a much better design is that in which 
all treatments are simultaneously or successively administered within each 
group or subpopulation selected. "This design will be considered in the follow- 


ing chapter. 
STUDY EXERCISES? 


1. Mohr ? carried out an investigation to compare the effectiveness of three 
methods of introducing third-grade pupils to the use of a separate answer 
sheet with a multiple-choice test. Method A; consisted of oral explanatory 
directions, followed by a practice lesson in the use of the answer sheet, fol- 


1 Hanson and Hurwitz, Journal of American Statistical Associalion, vol. 37 (1942), 
pp. 89 ff. 

2 See second paragraph on page viii. 

3 Richard H. Mohr. A Study of the Effects of Differential Directions for Teaching the 
Use of the Separate Answer Sheet al the Third Grade Level, M. A. Thesis, State University 


of Iowa; August, 1951. 


188 THE GROUPS-WITHIN-TREATMENTS DESIGN 


lowed by a practice test with the answer sheet, followed by a criterion test. 
Method A; was like Method Ai, except that the practice test was omitted. 
Method A; involved no special preparation of any kind; under this “method” 
the criterion test was self-administered (with printed directions only). The 
criterion test was a twelve-minute, 20-item test of reading comprehension ad- 
ministered with a separate answer sheet. The score on this test constituted 
the criterion measure in the experiment. 

For various reasons, some of which should be readily apparent, only one 
method was used in each school. Twenty-seven schools in north central Iowa 
were selected for the experiment. None of these schools was in a multiple- 
building system, all had between 30 and 40 pupils in the third grade, and none 
was currently involved in other experimental work. Twenty-one of the se- 
lected schools agreed to participate in the experiment. These schools were 
randomly assigned, seven to each of the three methods. Detailed instructions 
were sent to each participating teacher. All experiments were carried out (on 
one day) during the sixth week of the fall term. The criterion measure for 
each school was the mean score on the twenty-item reading comprehension 
test (answer sheet test). 


Methods 
(School means on answer sheet test) 

En As As 
1. Manly 13.2 1. Quimby 13.5 1. Gowrie 12.4 
2. Postville 11.1 2. Northwood 10.7 2. Williamsburg 12.8 
3. New Sharon 12.0 3. Roland 12.4 3. Lake Mills 11.6 
4. Rockford — 11.5 4. New Hampton 15.2 4. Sheffield 14.3 
5. Holstein 11.1 5. Keota 11.5 5. Belmond 11.8 
6. Lakota 10.5 6. Forest City 11.7 6. Osage 13.1 
7. Humboldt — 14.3 7. St. Ansgar 15.5 
M; 11.95 12.5 12.79 
YM? 1011.65 950.68 1149.75 


a) Complete the analysis and prepare a summary table. 


b) Describe the parent population in detail. Describe the methods popula- 
tions. 


c) May the hypothesis of equal methods population means be rejected at 
the 5% level? May one conclude that the methods are equally effective? 
Explain. Comment on the danger of a serious Type II error in this situa- 
tion. 

d) Do the restrictions on the selection of schools seriously limit the general- 


ity of the results of this experiment? Explain. How does the fact that 
one school failed to return the results affect the interpretation? 


e) 


f) 


g) 


STUDY EXERCISES 189 


The number of pupils actually tested in individual schools varied from 19 
to 39. Why is the analysis nevertheless based on unweighted school 
means? Justify this procedure in terms specific to this experiment. 


Suppose several of the selected schools had (without the knowledge of the 
experimenter) already used separate answer sheet tests with their third 
graders. Would the F-test of (b) remain valid? Explain. How is the 
validity of this F-test affected by the fact that the pupils in some schools 
are on the average much better readers than those in other schools? 


The F-test of (b) involves the assumption that what measures are nor- 
mally distributed? Is there any serious danger that the distribution is 
sufficiently non-normal to invalidate the F-test? Explain. 


h) Must the pupils in each school be regarded as a random sample from 


i) 


3) 


their "school population" in order for the test based on F — ms /MSowA 
to be valid? Explain. 


What types of error are considered in the F-test of (b)? Why? Cite 
specific illustrations of each type. 


What specific form does the assumption of homogeneity of variance take 
in this situation? Is it conceivable that these methods will create suffi- 
cient heterogeneity of variance to invalidate the F-test of (b)? Does an 
inspection of the data suggest an extreme degree of heterogeneity? 


k) How would you test the hypothesis that there are no systematic differ- 


ences among schools in the ability measured by this reading test? What 
new assumptions underlie this test of significance? If this hypothesis 
proved tenable, what test of the methods effect could one use as an 
alternative to that of (b)? Does this alternative test have any important 
advantage over that of (b)? Explain. 


The Random Replications Design 


The Generalized Case of the Random Replications Design 


When a population consists either of a number of finite groups or of infinite 
subpopulations, only a few of which may be represented in any single experi- 
ment, it is sometimes possible to duplicate the experiment for each of the 
selected groups or subpopulations independently. The design employed in 
each replication may be the simple-randomized design, or the treatments X 
subjects design, or the treatments X levels design, or any of a number of other 
designs to be considered later. In any case, disregarding any classifications 
(such as levels) other than treatments and replications, the criterion measures 
may be tabulated in a double-entry table, the columns corresponding to the 
treatments and rows to the replications, that is, to the selected groups or sub- 
populations. 


The Random Replications (A x R) Design When the 


Population Consists of Finite Groups 


We shall consider first the case in which the population is regarded as con- 
sisting of a number of finite grovps, and in which the sampling is by intact 
groups. Each of the r groups selected for the experiment is divided into a 
number of subgroups, one for each treatment. The analysis is based on the 
subgroup means, all being given the same weight. There is, therefore, only 
one entry in each cell of the table. Thus the total sum of squares (ssp) may be 
analyzed into its treatments (ss3), replications (ssp), and treatments X replica- 
lions (ss$g) components. The asterisks will distinguish sums of squares based 
on an analysis of cell means from those (without the asterisk) based on individ- 
ual measures. The computational procedures in the random replications 
design are in this case exactly the same as in the treatments X subjects design 
(page 157), replications (R) taking the place of subjects (S). 

The analysis may be presented in table form as follows: 

190 


WHEN THE POPULATION CONSISTS OF FINITE GROUPS 191 


Source df ss ms 
Treatments(A) (a — 1) ssh ss4/(a — 1) 
Replications (R) (r — 1) E ssp/(r —-10) 
Treatments X 

Replications (AR) | (a — 1)(r — 1) Fn ssar/(a — 1)(r — 1) 
Total ar—1 


Test of the Significance of the Treatments Effect: With this design, the treat- 
ments effect is tested by F = msí/msi;. The conditions! under which 
mi/ms*p is distributed as F are essentially the same as those listed on pages 
157-158 for the treatments X subjects design (groups, or replications, taking 
the place of subjects). These conditions, phrased in terms appropriate to this 
design, are listed below. As in previous instances, the last of the conditions 
represents the hypothesis to be tested; the other conditions constitute the 
assumptions underlying the test. 


1) The replications (groups) represented in the experiment are a simple 
random sample from a real or hypothetical population of such replica- 
tions. 


2) In each replication, the subgroups (together with all associated extrane- 
ous factors) are randomly assigned to the treatments. (After adminis- 
tration of the treatments, the subgroups may be regarded as randomly 
selected from hypothetical treatment populations.) 


3) The treatments X replications (AR) interaction effects are normally and 
independently distributed in each treatment population (except when 
a = 2, in which case the t-test of page 19, assuming normal distribution 
of differences, applies.) 


4) The distribution of interaction effects has the same variance for each 
treatment population (not necessary when a = 2). 


5) The (unweighted) mean of the subgroup means is the same for all treat- 
ment populations. 


The proof that under these conditions msi/msip is distributed as F is 
exactly similar to the proof presented on page 158 for the treatments X sub- 
jects design. The student should review this proof carefully and translate 
it into the terms of this particular design. 

In any particular experiment, if Conditions 1 to 4 are met, this F-ratio may 
be used to test the hypothesis expressed in Condition 5. In any application, 


1 See footnote on page 51. 


192 THE RANDOM REPLICATIONS DESIGN 


careful consideration should be given to each of the four conditions essential to 
the validity of the F-test. 

The important considerations so far as Condition 1 is concerned are essen- 
tially the same as those presented on pages 179-181 in the discussion of the 
groups-within-treatments design. The reasons for regarding the treatment 
subgroup as the unit of analysis, or for giving all subgroups the same weight 
regardless of size, are the same as those given on pages 173-174. Again, the stu- 
dent should review these discussions carefully with this particular design in mind. 

Since Condition 2 is subject to the control of the experimenter, there is gen- 
erally no reason why it should not be completely satisfied in any particular 
application. It should be noted that the individual subjects need not necessar- 
ily be randomly assigned to the treatment subgroups. There are a number of 
different possibilities for constituting these subgroups. One is to assign the 
subjects at random to the subgroups, in which case each replication represents 
an experiment of the simple-randomized type. Another possibility is to design 
each replication as a treatments X levels experiment, that is, to match" the 
subjects in the various subgroups on the basis of some control variable. An- 
other possibility is to administer all treatments in succession to the entire 
group, so that each replication is an experiment of the treatments X subjects 
type, and the various subgroups within a replication are different sets of ob- 
servations on the same subjects. Sometimes the groups are already organized 
into subgroups, and the experimenter must use the subgroups as he finds them. 
For example, in an educational experiment with methods of instruction, each 
group may represent the pupils in a given grade in a given school, and these 
pupils may already be organized into classes for instructional purposes. The 
experimenter may not be permitted to reorganize these classes for the purposes 
of his experiment, but may be required to use them as they are, with the teach- 
ers, classrooms, hours, etc., already assigned to them. Even so, he can still 
apply a valid test of significance to the differences among the general treat- 
ment means, if he can randomize the classes in each school with reference to the 
experimental treatments. Still other ways of constituting the treatment sub- 
groups in each replication will be considered in later chapters. 

While there are many possible ways of constituting the treatment subgroups 
in each replication, it is highly desirable that whatever method is employed be 
the same for all replications. If the subjects are assigned at random to the 
subgroups in one replication, they should be similarly assigned in all. If the 
subjects are matched on a certain basis in one replication, they should be 
matched on the same basis in all replications. In any case, of course, whatever 
the manner in which the subjects are assigned to the subgroups, the subgroups 
(together with all extraneous factors associated with them) should be independ- 
ently randomized with reference to the treatments. For example, in an in- 
structional methods experiment conducted in a number of schools, the pupils 
may be assigned to the experimental classes on a random basis. Each class 
may then be assigned a teacher, a classroom, a recitation period, and all other 
administrative arrangements may be completed that will affect the subgroups 


WHEN THE POPULATION CONSISTS OF FINITE GROUPS 193 


during the course of the experiment. Then, as a final step, the treatments may 
be assigned at random to the classes. In this fashion, all extraneous factors, 
such as teachers, classrooms, etc., will simultaneously be randomized with 
reference to the treatments. This procedure is essential if all extraneous fac- 
tors are to be taken adequately into consideration in the test of significance. 

When the analysis is based upon the unweighted means of the treatment 
subgroups, it is not essential, even though it may be desirable, that all treat- 
ment subgroups in any replication be the same size. In this situation, if any 
subjects are “lost” during the course of the experiment, no special problem is 
raised so far as the analysis of results is concerned, unless there is reason to 
believe that these losses result in a systematic bias with reference to treat- 
ments, which would very rarely be the case. 

So far as the requirement of normal distribution of interaction effects is 
concerned, the considerations are much the same as those discussed on page 
181. If the subgroups are the same or nearly the same size from replica- 
tion to replication, if a uniform control is maintained over Type $ and Type 
G errors in all replications, and if these errors are completely randomized with 
reference to treatments in each replication, it seems reasonable to assume that 
in most situations the interaction effects will be normally distributed for each 
treatment. As has been previously noted, these interaction effects are in part 
intrinsic (due to treatments only) and in part extrinsic (due to error). That 
part of the interaction which is extrinsic may be attributed either to Type Sor 
to Type G errors, or to both. It is possible that extraneous factors operating 
systematically on treatment subgroups (Type G errors) have been effectively 
equalized, so that the interaction effects are due primarily to Type S errors 
resulting from the random assignment of subjects to subgroups. In that case 
if the subgroups differ markedly in size from replication to replication, there 
may be some tendency toward peakedness in the distribution of the interaction 
effects for each treatment. This will happen because the corrected means of 
the large subgroups will tend to have a normal distribution with a small vari- 
ance, while the corrected means of the small subgroups will tend also to have a 
normal distribution, but one with a large variance. When these two distribu- 
tions are thrown together, the combined distribution will tend to be more 
peaked than a normal distribution. 

With subgroups of varying sizes, departure from normality is particularly 
likely if there is any correlation between the subgroup means and the sizes of 
the subgroups (such as is sometimes found between means of achievement test 
scores and sizes of schools — the larger schools usually making the higher 
average scores). 

Failure to exercise uniform control over Type G errors in all replications will 
tend to have much the same effect as differences in size of subgroups. If for 
some replications these errors are normally distributed with a large variance, 
while for others they are normally distributed with a small variance, the com- 
bined distribution will again tend to be more peaked than a normal distribu- 
tion. It is highly desirable, therefore, that both the size of the subgroups and 


194 THE RANDOM REPLICATIONS DESIGN 


the control over extraneous factors be as uniform as possible from replication 
to replication. 

What form of distribution the intrinsic interaction effects (Type R errors) 
will take is more difficult to say. However, in many experiments, the extrinsic 
interaction will be very much larger than the intrinsic interaction, and any 
lack of normality in the distribution of the intrinsic interaction effects will tend 
to be “covered over” by the normality of the predominant extrinsic interac- 
tion effects. 

As has been previously noted (pages 78-86), the assumption of normality is, 
fortunately, in general not a very critical requirement for an F- -distribution of 
the ms4/ms5; ratio. Granting reasonably uniform error control, plus random- 
ization of errors in all replications, it would seem that Condition 3 need not 
cause the experimenter much concern in most applications of this design. 

The requirement of homogeneous interaction between treatments and repli- 
cations will be particularly difficult to evaluate in many specific applications. 
However, there is good reason to believe that, in general, the interaction will 
not be sufficiently heterogeneous to disturb seriously the validity of the F-test 
of the treatment effects. We may note first that the interaction is always in 
part an extrinsic interaction, but that in some instances it may also be due in 
part to intrinsic interaction between treatments and groups (replications). It 
is quite possible that in some situations the intrinsic interaction may be 
markedly heterogeneous. For instance, in a treatments X schools experiment 
one of the treatments may depend to a marked degree on factors which differ 
markedly from school to school — such as environmental factors in the school 
and community, or the nature and adequacy of school facilities (e.g., labora- 
tory equipment and library), or curriculum differences which may predispose 
the pupils in some schools to more effective use of certain treatments than the 
pupils in other schools. The effectiveness of other treatments may be quite 
independent of such factors. "Thus, some treatments may be essentially addi- 
tive in their effects on the school means while others may not, and the intrinsic 
interaction may be quite heterogeneous. 

That part of the interaction due to error, however, should rarely be hetero- 
geneous, granting only that the errors have been randomized with reference to 
treatments in each replication independently. The extent to which the total 
observed interaction is heterogeneous, therefore, usually depends on the extent 
to which the intrinsic or the extrinsic interaction predominates over the other. 
If the extrinsic interaction, which is usually homogeneous, is much larger than 
the intrinsic interaction, which may be markedly heterogeneous, the total 
interaction will show only a moderate degree of heterogeneity — possibly not 
of sufficient, degree to disturb seriously the validity of the F-test. If the in- 
trinsic interaction predominates and is markedly heterogeneous, then the total 
interaction will be markedly heterogeneous also, and the F-test of the treat- 
ment effects may be vitiated. 

If the interaction is known to be markedly heterogeneous, a useful interpre- 
tation of the F-ratio may still be made. The effect of marked heterogeneity of 


WITH SUBGROUPS OF THE SAME SIZE 195 


interaction on the F-test is to result in a larger number of "significant" F’s 
than would otherwise be obtained. According to Cochran and Cox, an allow- 
ance can be made for the most extreme of such effects by regarding the 
ms* /msip ratio as having 1 and r — 1 degrees of freedom, rather than a — 1 
and (a — 1)(r — 1) degrees of freedom.! If the obtained F is “significant” at 
the desired level when regarded as having these reduced degrees of freedom, 
one may quite confidently reject the null hypothesis of equal treatment means, 
even though the interaction is heterogeneous. If the obtained F lies between 
the significant values for (a — 1)/(a — 1)(r — 1) and 1/(r — 1) degrees of free- 
dom, the result is much more difficult to interpret. In this case, if marked 
heterogeneity is suspected, perhaps the best procedure is to resort to separate 
tests of treatment effects for individual pairs of treatments. In an experiment 
involving only two treatments, no assumption of homogeneity of interaction is 
necessary. Accordingly, if F — msi/msip is computed independently for each 
of the important comparisons of two treatment means, the assumption of 
homogeneous interaction may be obviated. 


Replications of the Simple-Randomized Design with Subgroups of the 


Same Size (Random Sampling from Randomly Selected Subpopulations) 


In some applications of the random replications design, it is possible to 
regard the subjects in each replication as a simple random sample from a 
corresponding subpopulation (real or hypothetical), and also to regard the 
subpopulations represented in the experiment as having been randomly se- 
lected from a population consisting of a very large number of such subpopula- 
tions. For instance, in an experiment with methods of school instruction, it is 
possible to regard the pupils in each school as a random sample from a sub- 
population corresponding to that school, and to regard the schools in the ex- 
periment as a random sample from a population of schools. 

In such cases, if the simple-randomized design is used in each replication, 

and if the subgroups are the same size for all replications, it may be desirable 
to base the analysis on the criterion measures for the individual subjects rather 
than on the subgroup means. The total sum of squares (ss7) may then be 
analyzed (as in any double-entry table with proportional cell frequencies) into 
four components — in this case the treatments (ssa), replications (ssx), treat- 
ments X replications (ssar), and within-subgroups (554) components. Each of 
the first three components is n times as large as the corresponding components 
when the subgroup mean is the unit of analysis (ssr = n * ssp, 884 = n 884, 
etc.). 
The test of significance of the treatment effects is again based on 
F = msa/msan (which in this case equals mst/ms5) and the considerations 
1 See W. G. Cochran and G. M. Cox, Experimental Designs (New York: John Wiley 
and Sons, 1950), pp. 396-401. 


196 THE RANDOM REPLICATIONS DESIGN 


underlying the interpretation of this F are exactly the same (with one excep- 
tion to be considered later) as those discussed on pages 191-195. 

The principal advantage of using the subject as the unit of analysis is that 
it makes possible a test of the significance of the interaction, F = MSAr/MSy, 
The conditions under which this mean square ratio is distributed as F have 
been previously presented (pages 138-141). These should be carefully re- 
viewed by the student with specific reference to this design. 

Should the interaction prove nonsignificant, and should other considerations 
permit, we might assume that there is no interaction (either intrinsic or due to 
Type G errors). On this assumption, another test of the treatment effects is 
available. Let us suppose that in the double-entry table corrections have been 
applied to the measures within each replication (row) so as to make the mean 
for each replication equal to the general mean. In this corrected table, the 
sum of squares for between-cells-within-columns is the same as the sum of 
squares for interaction in the original table. Accordingly, if there is no interac- 
lion, all differences among cell means within any column of the corrected table 
are due to random Type S errors only. This is equivalent to saying that the 
measures in any column may be regarded as a simple random sample, or that 
the once-corrected table may be regarded as representing a simple-randomized 
design. In this case, ms',/ms;,4 for the once-corrected data is distributed as F. 
But 

ms’, = msa 
and 
Mea = (BSar + SSw)/ (dfar + dfo). 
Hence, 
F=ms,./ BON sse (54) 
dfar + dfo 
may, on the assumption of no interaction, be used to test the treatment effects. 

On the assumption of no interaction, ms4; and ms, are both unbiased esti- 
mates of the common within-cells variance (o°). Hence, on this assumption, 
either F = ms4/msar or ms,/ms, may be used to test the treatment effects, 
but (54) is preferable because of the larger number of degrees of freedom avail- 
able for this test. 

The assumptions underlying (54) do not require that the treatment sub- 
groups be the same size in all replications, although they do require that 
corresponding treatment subgroups be proportional from replication to repli- 
cation. 

It is important to note that if the subgroups differ in size from replication to 
replication (even though they are proportional), the ratio msa/msapr is nol dis- 
tributed as F. When the subgroups vary in size, therefore, the test of the treat- 
ment effects should, if other considerations permit, be based on the unweighted 
subgroup means. That is, F = msi/msip should be used rather than 
F = ms4/msar. A number of texts have wrongly suggested that, even though 
the subgroups vary in size from replication to replication, ms4/ms47 is distrib- 


TESTING DIFFERENCES IN INDIVIDUAL PAIRS 197 


uted as F under the same conditions that ms$/msi; is distributed as F. The 
student should be on guard against this error. 


The Special Case of “Simple” Replications 


“Simple” replications may be regarded merely as a special case of random 
replications in general. Replications in an A X R design may be called “ sim- 
ple" replications, if, instead of having been drawn from different subpopula- 
tions, the subjects in the various replications are all drawn at random from the 
same population. An example of a design involving simple replications was 
given on pages 18-20. 

In the case of simple replications, there can, of course, be no intrinsic AR 
interaction, The significance of the observed AR interaction can be due only 
to Type G errors which vary from replication to replication. The AR interac- 
tion is then a valid error term for testing the treatment effects only if the Type 
G errors have been independently randomized for each replication (see Condi- 
lion 2, page 191). 

The object of providing for simple replications of a simple-randomized or a 
treatments X levels design is to provide an error term for testing treatment 
effects that will take Type G as well as Type S errors into consideration. Be- 
cause of the relative importance of Type G errors in educational and psycho- 
logical experiments in general, the use of simple replications is an important 
device in educational and psychological research. 


Testing the AR Interaction in Random Replications of 


Treatments x Levels or Treatments x Subjects Designs 


When the simple-randomized design has been used in each replication, the 
AR interaction may be tested by F = msaz/ms,. This test may not be em- 
ployed, however, when the treatments X levels or treatments X subjects de- 
sign is replicated, since in that case the various treatment subgroups within 
cach replication are not independent random samples. Under certain condi- 
tions it is still possible to test the significance of the AR interaction in replica- 
tions of the treatments X levels design, but the appropriate test can more 
conveniently be considered in a later chapter (Chapter 10, pages 238-239). 


Testing Differences in Individual Pairs of Treatment Means 
We have seen that the random replications design is, so far as the computa- 


tional procedures and the tests of significance are concerned, essentially the 
same as the treatments X subjects design. In the treatments X subjects de- 


198 THE RANDOM REPLICATIONS DESIGN 


sign, the rows in the double-entry table correspond to randomly selected 
subjects; in the random replications design, they correspond to randomly 
selected replications; in both cases, the error mean square for testing the treat- 
ment effects is the interaction mean square. Accordingly, see (36), page 165, 
the estimated error variance of the difference between two treatment means in 
a random replications design is 


: 
est’d o d, uy = Prin - Ansan, df = (a — 1r — 2), (55) 


dai 2msin = 1 /2msan, 
rn 


the £ representing that for the selected level of significance and the given 
degrees of freedom. The differences among the treatment means are then 
classed as significant or non-significant, just as in the example on pages 93-94. 

These tests assume homogeneous interaction for all pairs of treatments as 
well as normal distributions of interaction effects. If homogeneity of inter- 
action is in serious doubt, the safest procedure is to employ as the error term 
for each pair of treatments the mean square for interaction computed from the 
data for those treatments only, in which case no assumption of homogeneity is 
involved. 


Establishing a Confidence Interval for the True Mean 


for a Given Treatment 


In establishing a confidence interval for the population mean for any treat- 
ment, the reasoning is the same as in the case of the treatments X subjects 
design (see pages 166-167). The estimated error variance for a given treat- 

ment mean is given by 


* 
MS pa 
est'd ou., = —— 
i r 
or, if n;; = n is constant, by 
3 ms, 4 
est'd ou. = —— 
1 rn 


in each case with (r — 1) degrees of freedom. 


Important Precautions in the Planning and Administration of a 


Random Replications (A x R) Experiment 


In the preceding discussion of the F-test of the treatment effects, we have 
noted a number of important implications of the assumptions underlying this 


test. 


IMPORTANT PRECAUTIONS 199 


In view of the importance of these implications, it may be well, even at 


the cost of some repetition, to restate them in the form of the specific precau- 
Lions to be taken in planning and administering an experiment of the random 
replications type, as follows: 


1) 


3) 


4) 


Make certain that the replications represented in the experiment are (or 
may be regarded as) a simple random sample from a meaningful popula- 
tion of such replications. In the A X R design, the R stands not only 
for “replications,” but for “random” replications as well. In some in- 
stances, the population must be defined to “fit” the sample actually 
taken; but unless this population is meaningful and closely related to 
some real population, the experiment will be of little value. 


Tnsure that as many as possible of the errors affecting the subgroup 
means are completely randomized in each replication. In general, this 
means that the random assignment of treatments to subgroups should be 
made afler completing all administrative arrangements affecting the 
subgroups during the course of the experiment. 


Provide for uniform error control in all replications, Give separate con- 
sideration to the control of Type S and Type G errors. Uniform control 
over Type S errors usually implies that the subgroups should be approxi- 
mately the same size for all replications. If subgroup n’s vary from repli- 
cation to replication, the appropriateness of basing the analysis on un- 
weighted subgroup means should be carefully considered. 


Provide for the closest possible control over errors of all types in all 
replications. This is not essential to the validity of the F-test, but it is 
essential if the treatment comparisons are to be precise, and if the ex- 
periment as a whole is to be efficient. Sometimes close control is im- 
practicable — as when, in a methods of instruction experiment, the . 
experimenter must use classes already organized in the schools, must use 
teachers of widely differing abilities, etc. In this case, a satisfactorily 
valid test is still possible with complete randomization, but each replica- 
tion is certain to be low in precision; high precision in the total experi- 
ment can then be secured only through the use of a large number of 
replications. In general, the more precise and efficient the design used in 
each replication, the more precise and efficient the entire random replica- 
tions experiment will be. Many possible designs, in addition to those 
already considered, will be suggested in later chapters. 


Employ a sufficient number of replications to provide a substantial num- 
ber of degrees of freedom for the error term (ms or msar). Otherwise, 
the sensitivity of the test and the efficiency of the experiment will be 
seriously impaired. Since the number of degrees of freedom for the error 
term depends on the number of treatments as well as on the number of 


200 THE RANDOM REPLICATIONS DESIGN 


replications, fewer replications will be needed if the number of treatments 
is large than if it is small. 


6) Make a careful, logical analysis of all factors that might result in a 
heterogeneous intrinsic interaction. Weigh as carefully as possible, also, 
the probable relative importance of intrinsic interaction and extrinsic 
interaction. If the extrinsic interaction appears to predominate strongly, 
the total interaction will probably be sufficiently homogeneous for the 
purposes of the F-test, even though the intrinsic interaction is hetero- 
geneous. If a predominant and markedly heterogeneous intrinsic inter- 
action is suspected, it may be well to break the entire experiment up 
into more homogeneous comparisons (unless the F is still significant when 
regarded as having the reduced number of degrees of freedom suggested 
earlier). 


7) Select the minimum level of significance at which you will reject the null 
hypothesis in terms of your judgment of the degree to which the under- 
lying assumptions have been satisfied in the experiment. 


Advantages and Limitations of the Random Replications Design 


The outstanding advantage of the random replications design over the 
simple-randomized, treatments X levels, and treatments X subjects designs is 
that it takes Type G and Type R errors, as well as Type S errors, into consider- 
ation in the test of significance. This is an extremely important advantage of 
this design, particularly in situations in which, for practical reasons, it is 
impossible to employ close control over various types of errors in individual 
replications, or in which there is a substantial interaction between treatments 
and replications. For this reason, the random replications design is almost 
essential to the conduct of satisfactory experiments with methods of school 
instruction. 

Another very important advantage of the random replications design, 
closely related to that just considered, is that it often permits the use, in each 
individual replication, of a design for which no test of the significance of the 
treatment effect is available so far as that replication alone is concerned. 

One instance of this kind has just been mentioned, that in which the sub- 
groups in an experiment with methods of school instruction consist of classes 
already organized in the school, and in which all the experimenter can do is to 
assign these classes at random to the treatments. In any one such replication 
considered alone, of course, no test of significance of the treatment effects is 
possible, because the treatment groups are not random. This characteristic of 
the random replications design deserves special emphasis, and will be given 
separate consideration later in the following section. 

Due to the control over differences among groups, the random replications 
design is usually much more precise than the groups-within-treatments design. 


USE OF ms,, AS AN ERROR TERM 201 


In the groups-within-treatments design, all group differences are contained in 
the error term. In the random replications design, a large part of these differ- 
ences (as measured by msj) is taken out of the error term; only the interaction 
effects remain in it. The advantage of the random replications design over the 
groups-within-treatments design is therefore essentially the same as that of 
the treatments X subjects design over the simple-randomized design. In ex- 
periments with methods of school instruction, differences among schools are 
sometimes of almost the same magnitude as differences among individual 
pupils in the same school; hence control over school differences adds immensely 
to the precision of the experiment. 

In the situation in which the total interaction is due primarily or entirely to 
Type S errors, the F-test based on ms%/ms’p is less satisfactory than one em- 
ploying a “within-cells” or a pooled error term, because of the much smaller 
number of degrees of freedom usually available for the interaction mean 
square. 


The Possibilities of Simple Random Replication 


Experimenters have often been far more ingenious in inventing designs to 
control various sources of error than in finding ways of testing the significance 
of the treatment effects in designs invented. This has been particularly true 
with many so-called “counterbalanced” designs in psychological research. 
The principle of random replication offers a general solution to the problem of 
evaluating the results obtained with many such designs. If simple replication 
of the design is possible — that is, if the same experiment can be repeated with 
independent random samples of subjects, and if the treatments can be com- 
pletely randomized with reference to the various sources of error in each repli- 
cation, a valid test of the treatment effect can usually be made by means of 
ms%/msip. The crucial requirements are that the interaction be homogeneous 
for all pairs of treatments and that the interaction effects be normally dis- 
tributed. The possibilities of simple random replication will be more fully 


explored in Chapters 10 and 13. 


The Use of ms,, as an Error Term in Treatments x Levels Designs 


In Chapter 5, page 145, the fact was noted that, under certain conditions in 
a treatments X levels design, ms4; may be employed as the error term in test- 
ing the significance of the treatments effect. These conditions are: 


1) The observed interaction is due to error only, that is, there is no intrinsic 
interaction. 

2) Type G errors have been randomized with reference to treatments at 
each level independently, 


202 THE RANDOM REPLICATIONS DESIGN 


3) These Type G errors may be regarded as a random sample from a hypo- 
thetical population of such errors. 


If these conditions are satisfied the design is essentially a random replications 
design so far as Type G errors are concerned, and the interaction (AL) mean 
square may be used as the error term in testing the significance of the main 
effect of treatments. 


STUDY EXERCISES ! 


1. The following problem is adapted from a study by Porter? which was 
concerned with the relative effectiveness of four methods of studying a given 
reading passage. The study methods differed primarily in the placement of 
questions (before or after the reading selection), in the degree of detail of the 
questions (main or main and subordinate), and in the source of the questions 
(pupil or experimenter). 

Fifteen elementary schools were randomly selected from a list of all elemen- 
tary schools in Iowa communities of over 20,000 population. Letters were sent 
to the principals of these schools requesting the co-operation of their eighth- 
grade classes in carrying out the study. Thirteen principals agreed to partici- 
pate and sent in copies of their eighth grade enrollment lists. From these lists, 
the pupils in each class were randomly assigned to four experimental groups of 
equal size. 

The materials, consisting of written directions, the reading selection, ques- 
tion sheets, and a 60-item multiple choice test over the selection, were organ- 
ized in booklet form. Since different experimental methods were produced by 
varying the arrangement of these elements, it was possible to administer the 
experiment at one sitting to all the pupils in a given school. Each student 
received the materials for the particular method to which he had been ran- 
domly assigned by the experimenter. The teacher then read several brief 
paragraphs of general instructions pertaining to the conduct of the experiment, 
following which she read four paragraphs of specific instructions; one directed 
at each of the four experimental groups. After questions had been cleared up, 
the experimental session proceeded, each method involving the same period of 
time. Immediately upon the conclusion of this period, all pupils took the 60- 
item test over the reading passage. 

The data are as follows: (The measures in the body of this table are subgroup 


means, M;;.) 


1 See second paragraph on page viii. R 
? William P. Porter, The Relalive Effectiveness of Questions and Their Placement in 


Direcled Study, Ph.D. Thesis, State University of Iowa, 1942. 


STUDY EXERCISES 203 


Study Methods (A) 
School (S) I I Hu IV Totals 


1 32.8 33.9 35.1 37.1 138.9 
34.6 37.1 381 39.4 149.2 


3 34.9 33.2 2342 34.6 136.9 
4 35.5 33.3 34.0 410 143.8 
5 343 27.0 322 354 128.9 
6 38.4 36.5 36.7 37.8 1494 Calculations Provided: 
7 33.1 288 30.0 325 124.4 
8 31.3., 31.8, 32.1. 32.4. 127.6 XTi/3 = 63710.1 
9 36.2 35.6 35.9 2348 142.5 
10 33.1, .. 35.6... 38.7 , 35.8. .143.2 XT$/A = 63861.9 
11 35.0 33.3 39.0 349 1422 
12 33.9 33.3 36.9 38.3 1424 EEM; = 64048.1 
13 38.1 34.7 40.3 36.8 149.9 


Totals 451.2 434.1 463.2 470.8 1819.3 
Means 347 33.4 35.6 36.2 


a) Complete the analysis of the data and prepare a summary table. 


b) It is obvious that some schools provided more pupils than others — hence 
the subgroup means within each treatment are based on varying n's. We 
have calculated our methods means as the simple average of the subgroup 
means. On what reasoning is this a legitimate procedure? 


c) Apply an over-all test of the significance of the differences among the 
method means, using the 1% level. State carefully the exact hypothesis 
tested [note (b) preceding]. 


d) State explicitly the assumptions underlying the test in (c). Discuss the 
probable extent to which these assumptions are violated, and the proba- 
ble effect on the validity of the usual F-test. 


e 


2 


If any very marked heterogeneity of interaction characterized this ex- 
periment, how might you detect its presence? That is, what would you 
look for in the table of means? (See Exercise 2, Chapter 6, page 169.) 


f) What advantages accrue from administering the tests simultaneously 
and on a group basis to all the pupils in one room? How would the re- 
sults probably have differed if the different methods groups in each school 


had been tested in different rooms and under different teachers? 


co 


Discuss the probable relative magnitudes of the intrinsic and extrinsic 
components of the AS interaction. What may be true of the previous 
instructional experiences of the pupils in various schools that may give 
rise to an intrinsic interaction? What are several possible specific sources 


S 


g 


204 THE RANDOM REPLICATIONS DESIGN 


of extrinsic interaction in this design? Discuss the possibilities of render- 
ing the latter effects negligible through careful experimental control. 


h) Suppose the observed interaction in this case is primarily intrinsic. Ex- 
plain why mss is nevertheless a valid error term for testing treatment 
effects? 


i) Define the population to which the results of the F-test of (c) may be 
generalized on the logic of statistical inference. On what basis might you 
be justified in generalizing to all Iowa schools? 


j How would you analyze the original data in order to test the significance 
of the interaction? Suppose this interaction had been found to be non- 
significant. Would this mean that the test based on F = msi/msás was 
invalid? Explain. 


2. One part of an (hypothetical) experiment (suggested by a study by 
Sheffield!) with albino rats was concerned with the relative effects on resist- 
ance to extinction of varying degrees of reinforcement during training in a 
runway situation. The four experimental treatments differed in the percent of 
the 32 training trials on which reinforcement (reward) was provided — 25% 
for Ai, 50% for As, 75% for As, and 100% for A4. The pattern of the rein- 
forced trials was cyclic, subject to the restriction that the first, third, and last 
trials were reinforced under every treatment. 

Evidence from previous research suggested that the nature and extent of 
previous experience of the rats might affect their performance under these 
experimental conditions. All animals in the available colony were normally 
kept in living cages of four rats each. The previous experiences of the rats 
were much the same for rats in the same cage, but was known to have differed 
considerably from one cage to another. However, the exact “experiential 
history” of each cage was not available. Therefore, ten cages were selected at 
random from among all cages in the colony and the four rats in each cage were 
randomly assigned, one to each of the four treatments. 

The apparatus consisted of a four-foot alley connecting a starting box and a 
goal box. Timing devices were arranged so that starting time and running 
time could be automatically recorded. 

After a week of adjustment to a common feeding schedule, all animals were 
given ten exploratory trials in the apparatus. On the following day, each ani- 
mal received 32 training trials, a specified proportion of which were reinforced. 
Reinforcement was provided by a small amount of wet mash and the animal 
was allowed to remain in the goal box for ten seconds. The inter-trial interval 
was held constant at 15 seconds. Immediately following the 32 training trials, 

1 Virginia F. Sheffield, "Extinction as a Function of Partial Reinforcement and 
Distribution of Practice,’ Journal of Experimental Psychology, 39: 511-526; August, 
1949, 


STUDY EXERCISES 205 


all animals were fed their normal ration minus the amount received during the 
training trials. 

On the third and last day of the experiment, all animals were given 30 
extinction trials, similar in all respects to the training trials except for the 
absence of reinforcement. "The response time (sum of starting and running 
times) was determined for each animal on each extinction trial. The criterion 
measure for each animal was defined as the number of extinction trials, out of 
the 30, on which its response time was less than the median response time for 
the total group of 40 rats. 

It should be noted that rats tend to be less active in a clean runway than in 
one just used by other rats. In this experiment the runway was cleaned at the 
beginning of each day, but not during the day. 

The criterion measures and some computational results are presented be- 
low: 


Cage Ai (25%) Az (50%) As (75%) A, (100%) Total 


Ri 10 12 14 9 | 45 
R: 14 22 18 21 1-75 
Bs 18 20 21 18 IIT 
Ry 20 16 10 17 | 63 
Rs 10 9 13 10 Nee} 
Re 9 15 9 15 ! 48 
Rz 15 18 14 1 ! 58 
Rs 13 17 14 16 ! 60 
Ry 8 13 9 14 E" 
Rio 229] 14 12 1 ' 42 
Total 126 156 134 138 554 
M; 12.6 15.6 13.4 13.8 

xx 1740 2568 1928 2082 


a) Complete the calculations and prepare a summary table of the analysis. 


b) From the information provided, define the population from which these 
10 cages may be regarded as a random sample. 


c) State the hypothesis tested by the ratio ms4/msag. May this hypothesis 
be rejected at the 5% level? Does this test involve the assumption that 
the four rats in each cage are a random sample from a corresponding 
subpopulation? Explain. 

d) What is presumably the purpose of the 10 exploratory trials? What 
would presumably be done if one rat failed to respond at all during these 
preliminary trials? 


€) The 10 rats under Treatment A, (100% reinforcement) were given their 


206 


THE RANDOM REPLICATIONS DESIGN 


extinction trials first, the 10 rats under Treatment A; (25% reinforce- 
ment) last. Could this result in an “order” effect which partially can- 
celled the treatment effects, and hence explain why the F of (c) was not 
significant? Explain. 


How might one randomize the “order” effect? If this were done, and 
assuming that there is a real order effect, would ms47 be larger than if the 
same order had been employed with each cage? Explain. 


g) Suppose that not more than 20 animals could be “run” in a single day, 


either for the training or for the extinction trials, and that the experiment 
was therefore broken up into two independent parts — the experiment 
being completed for 20 of the rats (5 cages randomly selected) before 
anything was done with the remaining rats. Suppose also that there was 
a marked day effect (which was thus confounded with the R effect) but 
assume that there was no days X treatments interaction, and that the 
data were analyzed just as in the situation earlier described. Is the test 
F = ms4/msan still a valid test of the treatments effect? Explain. (Why 
is the assumption of no days X treatments interaction necessary?) 


h) Suppose that not more than 8 rats could be run in one day, so that the 


i) 


experiment had to be broken up into 5 independent parts, each con- 
cerned with 2 randomly selected cages. Suppose also that there was a 
marked days X treatments interaction, but that the 5 days involved 
could be regarded as a simple random sample from a population of days. 
Could the experiment still be regarded as a random replications experi- 
ment? How then should the data be analyzed and the treatments effect 
tested? (See pages 163-164.) 


The criterion measure used in this experiment was selected because it 
seemed more likely to be normally distributed than the number of trials 
to extinction, which was considered as an alternative. Why is it more 
likely to be normally distributed? Is it otherwise as adequate or valid a 
criterion as the number of trials to extinction? What generalization does 
this suggest concerning the choice of a criterion measure in a psychologi- 
cal experiment? 


Would the experiment have been more or less precise if the 40 available 
rats had been randomly assigned four to a cage at the beginning of the 
experiment and the same analysis employed? Explain. 


k) Why, presumably, did the experimenter employ four treatments (degrees 


of reinforcement) rather than only two or three? 


| 


Factorial Designs (Two Factors) 


The Generalized Case of the Two-Factor (A x B) Design 


The basic nature and purposes of the factorial design have already been 
considered in Chapter 1 (pages 20-23), and these should be reviewed carefully 
by the student before he proceeds with this chapter. In that introductory 
discussion, the design was illustrated with an example in which there were only 
two categories in each treatment classification (two styles of type and two sizes 
of type), and in which the number of cases was the same for each treatment 
group. In the generalized case of the two-factor design, there may be any 
number of categories in either treatment classification, and the number of 
cases may differ from cell to cell in the same row or column of the double-entry 
table, but must be in the same proportion from column to column or from row 
to row. 

In order to refer to a specific example, we will extend the illustration used in 
Chapter 1 to include four styles and three sizes of type, so that the design is as 
diagrammed below. 


Style of Type 
Ai A: As Ay 
Clarendon Roman Gothic Italic 
Size Bi— 8 pt. Mi. 
of — By—10pt. Ms. 
T. 
7" Bale pk Ms 


Ma Mz Ms; Ma 


By a “factor” we mean one of the bases on which the treatments are classi- 
fied. In the illustration, style of type is one factor and size the other. The 
207 


208 FACTORIAL DESIGNS (TWO FACTORS) 


number of different “treatments” in a two-factor design is, in one sense, equal 
to the product of the numbers of categories in the two classifications, or the 
number of cells in the double-entry table. In this sense, there are twelve 
"treatments" in the experiment diagrammed above. In this discussion, how- 
ever, we will use the term “treatments” to represent the various categories in 
each major classification, and will use the term “treatment-combination” to 
refer to the treatments applied to the subjects in a single cell of the table. 
Thus, if there are a columns and b rows there will be a treatments in the 
column classification, b treatments in the row classification, and ab treatment- 
combinations. 

Sometimes the treatments in one or both of the classifications represent 
different degrees or amounts of the same factor, in which case the treatments 
are clearly ordered. Sometimes the treatments are categorically described and 
are non-ordered. In the preceding illustration, style of type is non-ordered, 
but size is ordered. When the treatments in a classification are ordered, as 
when the various treatments represent increasing durations of time for the 
same experimental condition, special interest may be shown in the pallern or 
Irend of the average scores for those treatments. Such questions may be raised 
as “Do the averages fall along a straight line? ", “Is the trend described by a 
parabolie curve?", etc. In this chapter we shall not be concerned with the 
trend of the treatment means, but only with whether or not the treatments 
differ at all, that is, with tests of null hypotheses. The problem of trend analy- 
sis will be considered in a later chapter (Chapter 15). 

The terms “main effect," “simple effect," and “interaction effect” have the 
same meanings in a factorial design as in a treatments X levels design (see pages 
122-123). However, in the factorial design there is a main effect for each 
of the two factors, as well as simple effects for each factor at given levels of 
the other factor. 

The computational procedures employed with the factorial design are ex- 
actly the same as those employed with the treatments X levels design (see 
page 123), the categories of the second factor (B) taking the place of levels 
(L). 


The analysis may be presented in table form as follows: 


df ms 


a-1 ssa/(a — 1) 
b-1 88p/(b — 1) 
Cells ab—1 


AB (a — 1)(b — 1) S$4p/(a — 1)(b — 1) 
Within cells (w) N — ab 88,/(N — ab) 


Total N-1 


TEST OF THE AB INTERACTION 209 


The Test of the AB Interaction and Its Interpretation 


Since the evaluation and interpretation of the main and simple effects de- 
pends on whether or not one may assume that there is no interaction, we shall 
consider first the test of the significance of the interaction and its interpreta- 
tion. In any two-factor design, the significance of the interaction is tested 
exactly as in the treatments X levels design, and on exactly the same as- 
sumptions. These assumptions, given on page 138, should be carefully 
reviewed at this point. If A and B represent the column and row factors re- 
spectively, and a and b represent the numbers of A and B treatments, or the 
numbers of columns and rows, respectively, the test is 


F = msap/mss, df = (a — 1)(b — 1)/(N — ab), (56) 


in which ms47 is the interaction mean square, ms, the within-cells mean square, 


a b 
and N is the total number of cases (v = Yu) 
j=l i=l 

As we have repeatedly noted previously, an observed interaction may be 
wholly extrinsic (due entirely to Type S and Type G errors), or it may in part 
be intrinsic (due to the treatments only). A significant F = ms45/ms, means 
only that the observed interaction cannot reasonably be accounted for entirely 
in terms of random Type S errors, but it does not rule out the possibility that 
the interaction is wholly extrinsic. What weight should be given to this possi- 
bility depends on the extent to which the Type G errors have been controlled 
or equalized for all treatment combinations in the experiment. The interpre- 
tation of a significant F = msaz/ms, therefore calls for a judgment on the part of 
the interpreter concerning the effectiveness of the experimental error controls. 

Sometimes it is quite evident that Type G errors are negligible. In the 
illustrative experiment, for example, all twelve treatment-combinations may 
have been administered simultaneously and under exactly the same conditions 
to different members of the same group. The reading-rate test used may have 
been printed in twelve editions, one for each size-style combination, and may 
be alike in all other respects. The tests may have been administered in a room 
large enough to permit testing all subjects simultaneously in a single group. 
The tests may have been passed out in repeated sequence, 1-2-3- . . . 12, so 
that if the subjects are seated in a random order, a random twelfth of the 
group takes each test. In this situation it is very difficult to conceive of any 
extraneous factor or circumstance that would systematically affect any one 
treatment-combination differently from any other. Experimental errors may 
occur; for instance, an unanticipated distraction may arise during the testing 
period, but presumably it would affect all treatment-combinations alike, and 
could hardly result in any interaction effect. In such an experiment, the 
observed interaction would presumably be due only to Type S errors or to an 
intrinsic interaction, or both. If the test of significance shows that Type 8 
errors could not reasonably account for all of the observed interaction, the 
inference would be clear that an intrinsic interaction is present. 


210 FACTORIAL DESIGNS (TWO FACTORS) 


This illustration, however, is by no means typical of factorial experiments in 
general. Most often, each treatment-combination must be separately admin- 
istered, frequently each at a different time and sometimes by different individ- 
uals. In such cases, quite obviously, extraneous factors could often systemati- 
cally affect some treatment-combinations and not others; these Type G errors 
could then give rise to a large observed interaction. 

Assuming that Type G errors are negligible, a significant F = ms4z/ms, 
means that there is an intrinsic interaction, or in general that the differences 
among the population means of the treatment-combinations in one row (or 
column) of the table are not necessarily the same as the corresponding differ- 
ences in any other row (or column). In the illustration, for instance, the style 
of type which produces the fastest reading rate in combination with one size of 
type may not be that which produces the fastest reading rate in combination 
with another size of type. 

The possibility should always be considered that the interaction is hetero- 
geneous, For instance, differences among the effects of corresponding A- 
treatments may be very nearly the same for B; as for B», but may differ con- 
siderably from B; to Bs. It is conceivable that the interaction in a particular 
experiment is due entirely to the effect of just one of the treatments in either 
classification. It is possible, for example, that if Row 3 (Bs) were removed 
from the table, the rest of the table would show no interaction at all. It is 
even conceivable that the interaction is due entirely to the effect of just one 
treatment-combination, and that if the criterion mean could be appropriately 
changed for just one cell in the table, there would be no interaction. Such 
outcomes are relatively unlikely, particularly if the various treatments in each 
classification represent different degrees or amounts of the same factor, but 
they are always possibilities. It should be noted that the test F = msaz/msw 
is not very sensitive to an interaction which characterizes only a small part of 
the entire table. The fact that the total interaction proves non-significant, 
therefore, does not rule out the possibility that such an interaction exists, but 
it does render dubious any attempts to identify such interactions from closer 
inspection of the experimental data (see pages 48—49). 

The fact that an interaction is “significant” does not necessarily imply that 
it is of much practical importance. If the precision of the experiment is high — 
that is, if the mean for each treatment-combination is based on a very large 
sample and has a small standard error, a relatively weak interaction may 
prove significant. If the precision of the experiment is low, even a very potent 
interaction may fail to prove significant. The implications of these possibili- 
ties will be considered more fully in later sections. 

If the interaction proves non-significant and if other considerations permit, 
one is free to assume that there is no interaction so far as the entire population 
is concerned (although this has by no means been proved). This is equiva- 
lent to assuming that the simple effect of each treatment in either classifica- 
tion is the same for all levels of the other classification. In a sense, the main 
effect of a treatment or factor is simply the (weighted) average of all its simple 


TESTING THE SIGNIFICANCE OF THE MAIN EFFECT 211 


effects; or each simple effect is simply an estimale of the main effect. On the 
assumption of no interaction, therefore, our interest would be entirely in the 
more stable main effects; there would be no point at all in inquiring into the 
simple effects. 

A significant interaction, on the other hand, suggests that the differences 
among the simple effects of the treatments in one classification may be differ- 
ent for each treatment (in the other classification) with which they may be 
combined. In this case, the main effect of any treatment or factor still has a 
definite meaning, but ordinarily a much less useful one than when the inter- 
action is negligible or nonexistent. Suppose, for instance, that in our illustra- 
Live experiment the interaction had proved significant, but that a comparison 
was nevertheless made between the main effects of B; and Bs. Then suppose 
it was found that (M, — M) is significant (using a test to be described later). 
For what population may any inferences be drawn on the basis of this signifi- 
cant difference? If the numbers in all columns are the same, one-fourth of the 
* population" in this case consists of individuals who received Treatment Ai, 
one-fourth received As, one-fourth As, and one-fourth Ay. Quite obviously, 
there would in general be very little practical interest in any such population. 
The interest rather would be in the possibility, for instance, that Clarendon 
type is better than Gothic when used with one size of type, that these styles 
are about equally good when used with another size of type, but that Gothic 
may be superior to Clarendon for still another size of type. In this case, also, 
the interest would be in what combination of size and style results in the most 
rapid reading, rather than in which size is on the average best for all styles, 
each style receiving the same amount of use, or in which style is best with all 
sizes, each size receiving the same amount of use. 

We may then summarize the preceding discussion as follows. If the inter- 
action is not significant and if other considerations permit us to assume that 
there is no interaction (either intrinsic or extrinsic), we w ill have no interest in 
“simple” effects, since the simple effects for either factor are presumably the 
same (except for chance) for all levels of the other factor. In this case, we 
would be interested only in the main effects, and any conclusions drawn about 
the treatments in one classification would be equally applicable at all levels of 
the other classification. If, on the other hand, the observed interaction proves 
significant, and if we may assume that there is an intrinsic interaction, we 
would ordinarily have very little interest in any main effects, but would center 
our attention on the simple effects or on individual treatment combinations. 
We thus see that it is usually desirable to test the significance of the interaction 


before proceeding with any other tests. 


Testing the Significance of the Main Effect of Either Factor 


The main effects of factors A and B respectively are tested by 
F = msa/msy, df = (a — 1)/(N — ab), (57) 
F = msz/mSw, df = (b — 1)/(N — ab). (58) 


and 


212 FACTORIAL DESIGNS (TWO FACTORS) 


The conditions ! under which these mean square ratios are distributed as F 
are listed below. The student will note that there is no essential difference 
between these conditions and those listed on pages 133-134 for the treat- 
ments X levels designs. 


1) The subgroups receiving the various treatment-combinations were, at 
the beginning of the experiment, all drawn at random from the same 
population (the available subjects have been randomly assigned to the 
treatment subgroups). 


2) The distribution of criterion measures for the subpopulation correspond- 
ing to each treatment-combination is a normal distribution. 


3) Each of these distributions has the same variance. 


4) The numbers in the corresponding treatment subgroups (cells) are in the 
same proportion from row to row or from column to column of the table. 


5) a) The population mean for each A-treatment (column) is the same. 


b) The population mean for each B-treatment (row) is the same. 


Which version of Condition 5 isto beemployed depends of course on whether 
the main effect for A- or B-treatments is being tested. 

The proof that under these conditions the mean square ratios in (57) and 
(58) are distributed as F is exactly the same as that presented on pages 134- 
135; 

In any application of this design, careful consideration should be given to 

the extent to which the underlying assumptions (Conditions 1 to 3) are satis- 
fied in the experiment. 
' For reasons already discussed, it is frequently impossible for the experi- 
menter to draw the samples for the various treatment combinations strictly at 
random from the real population in which he may be interested. In nearly 
every application, however, it is possible for him to randomize the available 
subjects with reference to cells. Hence, in practice, Condition 1 usually means 
random assignment of subjects to subgroups. In this case, of course, the 
population to which inferences may be drawn must be defined to fit the sample 
of available subjects. 

The important considerations, so far as Conditions 2 and 3 are concerned, 
are the same as those discussed in previous chapters and need not be repeated 
here. 

Condition 4 is nearly always wholly within the control of the experimenter; 
hence, it should ordinarily cause no concern for the validity of the F-test. 

1f any of these conditions are only approximately satisfied, the ratios of the 
mean squares will be distributed only approximately as in the F-distribution, 
and the usual effect will be to make the ratio of the mean squares appear to be 


1 See footnote on page 51. 


TESTING THE SIMPLE EFFECTS 213 


more “significant” than it really is. If there is any serious doubt concerning 
Condition 3, it may be well to apply the Bartlett test before proceeding with 
the analysis. 

The “population” referred to in 5a is that for which the measures in any 
column may be regarded as a representative sample with reference to the B- 
categories. For instance, the population mean for Treatment A, is the mean 
of a population all members of which receive Treatment A,, but in which a 
certain proportion receive Treatment B; in combination with A;, another pro- 
portion Treatment B» in combination with Aj, etc., the proportions being 
exactly the same as in the experimental sample. A similar statement may be 
made about the population referred to in 5b. For reasons already suggested, 
there would ordinarily be very little interest in any such populations, since 
only very rarely would one propose in practice to administer any treatment to 
any such population. 

If corresponding differences among the A means are the same for all levels of 
B, it follows that if the null hypothesis concerning the A means is true for one 
level of B, it is true for every other. So far as the test of the null hypothesis 
(5a) is concerned then, the distribution among B levels in the population is of 
no importance. Assuming no interaction, the F-test of (57) may be regarded 
as a test of the null hypothesis for any or all of the B levels. 


Testing the Simple Effects of Either Factor 


The simple effect of the A factor for a given level, i, of B, is tested by 
F = msavn,/msu, df = (a — 1)/(N — ab), (59) 


or by 
F = msawp;/ mSo (for row B; only), df = (a — 1)/(ni. — a). (60) 


In both expressions above, ms4wz, is the mean square for between-treatments 
computed only from the data in the ith row of the table, as in a simple-random- 
ized design. In (59), ms, is computed from the entire table, while in (60), 
MSw (for row B; only) 18 computed for within-cells from only the data in Row B;. 
The second of these tests (60) would be employed if the assumption of homo- 
geneity of variance were regarded as tenable for Row B; alone, but untenable 
for the entire table. 

The conditions under which these mean square ratios are distributed as F 
are the same as those listed on page 73, these conditions having to be satisfied 
only for the part of the table involved in the test. It will be noted that the 
test of a simple effect of either factor is of exactly the same character as the 
test of the treatments effect in a simple-randomized design. That is, so far as 
the A-comparisons are concerned, the factorial design may be regarded as a 
number of independent simple-randomized experiments, one conducted for 
each level of B. A similar statement may be made, of course, with reference to 


the B-comparisons. 


214 FACTORIAL DESIGNS (TWO FACTORS) 


The simple effect of the B factor for level A; of A is tested by 


F = mspwa;/ms,, df = (b — 1)/(N — ab) (61) 


or by 
F = mspa,/MSw (oc col. 4, oniy» df = (0 — 1)/(n., — b). (62) 


Individual Comparisons of Row, Column, or Cell Means 


'The estimated error variance of any cell, row, or column mean is given 
respectively by 


2 
CM, p, = mSu/nij 
Pi 
2 
Cg = msv/ni. 
i 
2 
OM, = mse/n.j 
i 


Given these error variances, the usual procedures may be followed to test 
the significance of the difference between any two means. The number of de- 
grees of freedom for each of the above error variances is (N — ab). Each of 
these error variances is based on the assumption of homogeneous error vari- 
ance throughout the entire table. If this assumption is in serious doubt except 
for the two means compared, the error variance may be computed only for the 
data on which the given means are based. 


The Meaning of ms, {or ;/ms,; When the Interaction Is Significant 


If there is an intrinsic interaction between A and B for a population, there 
is no meaningful hypothesis under which the ratio ms4/ms4p or msg/msan is 
distributed as F. This is because the categories in neither the A nor B classi- 
fications may be regarded as randomly selected from a population of such cat- 
egories, so that the interaction effects may not be regarded as random effects. 
Nevertheless, if the precision of the experiment is high, that is, if the total 
number of subjects per treatment-combination is large, these ratios do have 
a useful meaning (although not a probability meaning). 

The fact that the interaction is significant and intrinsic does not necessarily 
imply that the true rank order (that is, the order based on the population 
means) of the A (or B) treatments is different for each level of B (or A). Even 
though the interaction is intrinsic and pronounced, it is possible that the main 
effect of A is so much more pronounced than the interaction effect that the 
rank order of the A treatments is the same, or approximately the same, for each 
level of B (and likewise for the B treatments). By a reasoning similar to that 
presented on pages 141—144, the following rule-of-thumb may be justified. If 
ms4/msap is greater than 4b, one may fairly safely conclude that the true 
rank order of the A treatments is approximately the same for all levels of 
B, even though the differences among the true treatment means for one level 


HOW TO MAKE COMPREHENSIVE TESTS 215 


of B differ somewhat from the corresponding differences for any other level of 
B. Similarly, if msp/msag is greater than 4a, the true rank order of the B 
treatments is probably approximately the same for all levels of A. 

As is true of any rule-of-thumb, these rules should be used with considerable 
caution. It should be noted particularly that their validity depends on the 
precision of the means for individual treatment-combinations, and on the ex- 
tent to which Type G errors are negligible. 


The Conditions Under Which ms, (or ;,/ms,; Is Distributed as F 


If there is no intrinsic interaction and if Type G errors have been independ- 
ently randomized for each level of B (or A) and if all the other conditions listed 
on page 212 are met, with the additional requirement of equal n’s in all cells, 
T4 (or p)/ mS4p is distributed as F. In this case, we may test the significance of 
the treatment effects by 


F = ms4/msas, df = (a — 1)/(a — 1)(b — 1), (63) 
or 
F = msp/msas, df = (b — 1)/(a — 1)(b — 1). (64) 


Since there is no intrinsic interaction, the interaction effects for the b rows (or 
a columns) of the table may be regarded as a simple random sample from a 
hypothetical population of such interaction effects for an indefinite number of 
rows (or columns). 

The trouble with the preceding observation is that if the observed inter- 
action is significant there is no objective or statistical basis upon which one 
may determine whether the interaction is in part intrinsic or is entirely extrin- 
sic. Hence, the use of the preceding F-test must often be based upon the 
relatively unsupported assumplion that there is no intrinsic interaction. Some- 
times the nature of the treatments may be such that this assumption is ex- 
tremely plausible, but there is always the possibility that it is false in spite of 
its plausibility. In general, as will be made clear later, an interaction mean 
square should be used as an error term only if it is quite definitely known that 
the observed interaction effects are, or may be regarded as, a simple random 
sample from a meaningful population of such effects. Even then, of course, it 
should be used as an error term only if the interaction is homogeneous and if 
the numbers of cases are the same for all cells of the table, or if the interaction 
mean square is based on an analysis of the unweighted cell means. 


How to Make Comprehensive Tests of Significance When the Interaction 
Is in Part Intrinsic and in Part Due to Randomized Type G Errors 
We have seen that the test F = msa/ms. (or F = msg/msy) takes into 


consideration random Type S errors only. We have seen also that if Type G 
errors have been randomized for each row independently and if n;; = n is con- 


216 FACTORIAL DESIGNS (TWO FACTORS) 


stant, we may, on the assumption of no intrinsic interaction (usually a ques- 
tionable assumption), test the significance of the main effect of A by means of 
F = ms4/ms,g (and similarly for the main effect of B). In this case, the test 
of significance takes both Type S and Type G errors into consideration. In 
the event that the interaction is in part real, however, we have, for the simple 
factorial design, no way of testing the significance of a main effect that takes 
both Type S and Type G errors into consideration. 

The latter is one of the situations to which a solution is provided by the 
principle of random replication (see page 201). If a two-factor experiment is 
being planned, if there is reason to believe that an intrinsic interaction may 
exist, and if Type G errors can be randomized independently in each replica- 
tion, the thing to do is to provide for a number of random replications of the 
experiment. That is, the whole experiment should be repeated a number of 
times, each time with an independent random sample of subjects and with an 
independent randomization of Type G errors. Tests of significance valid both 
for Type S and Type G errors can then be made, not only for the main effect of 
either factor, but also for the interaction and for the simple effects. The man- 
ner in which such a design may be analyzed will be considered in the next 
chapter (pages 230-237). It is particularly important to note that by thus 
replicating a simple two-factor design, it is possible to test the hypothesis that 
there is no intrinsic interaction — an hypothesis it is impossible to test in a 
single replication of a two-factor design. 


The Use of Transformations 


The use of transformations with the two-factor factorial design presents es- 
sentially the same problems as with the treatments X levels design (see pages 
149-151), the difference being that the second treatment factor (B) takes the 
place of the control factor (L). Asin the case of the treatments X levels design, 
if there is an interaction on the transformed scale, the hypothesis that the means 
of the treatment populations are identical cannot be true with reference both 
to the original and the transformed measures. Unless the main effect of one of 
the treatments is nil, the hypothesis of no interaction cannot be true for both 
the original and transformed measures. If one wishes to test the hypothesis of 
no interaction on the original scale, one must determine what is the equivalent 
hypothesis on the transformed scale and then test this hypothesis, rather than 
test the hypothesis of no interaction on the transformed scale. 


STUDY EXERCISES? 


1. Teichner? carried out an experiment concerned with the effect of inter- 
trial time on the acquisition and extinction of an instrumental response in 


1 See second paragraph on page viii. 
2 Warren H. Teichner, Experimental Extinction as a Function of the Intertrial Time 
During Conditioning and Extinction, Ph.D. Thesis, State University of Iowa; February, 


1951. 


STUDY EXERCISES 217 


white rats. Three groups of hooded male rats were selected at random from 
the available colony and each group was randomly divided into four subgroups 
of equal size. The following diagram indicates the sizes of the groups and sub- 
groups, as well as the intertrial times used with each. 


Intertrial Time in Seconds 


Acquisition Extinction 
Groupl (N,;= 40) 30 30 45 60 90 (mi = 10 each) 
Group II (Nz = 100) 45 30 45 60 90 (m = 25 each) 
Group III (N; = 40) 90 30 45 60 90 (ns — 10 each) 


A side view of the problem box used in the experiment is shown below: 

The animals, after a week 
of adjustment to handling 
and to a 23-hour feeding C) 
schedule, were each given light 
a 10-minute exploratory 
session in the apparatus. 

On the following day, each 

animal in turn was given 

two preliminary trials designed to assure movement toward the food tray 
during the learning series, followed immediately by 15 training trials under 
the appropriate intertrial time condition. On each trial, latency was auto- 
matically recorded as the time between raising of the door and depression of 
the tray to obtain a pellet. As soon as the pellet was obtained, the tray 
returned to normal position, the door was lowered and the rat was allowed to 
eat. A latency of 10 seconds or more was classed as no response — the door 
was automatically lowered after this period. 

Extinction trials were started immediately after the last training trial, 
and were exactly like the training trials except that no food was presented, 
Extinction trials were continued until the animal failed to respond on three 
successive trials. The criterion measure for each animal was the number of 
responses during the extinction series. The data in the body of the table im- 
mediately following are the criterion means for each subgroup. 


guillotine door 


food tray 


v 


Intertrial Interval During Extinction (E) 
30 sec. 45 sec. 60 sec. 90 sec. Means 


Intertrial — 30 sec. 
Interval 


Dort eis sec, 
Acquisition 
(A) 
90 sec. 


Means 


218 FACTORIAL DESIGNS (TWO FACTORS) 


Summary Table 


Source df E 
Acquisition (A) 2 275.66 137.83 
Extinction (E) 3 64.95 21.65 
Interaction (AE) 6 12.20 12.03 
Within-Cells (w) 168 2236.52 p 13.31 
Total 179 2649.33 


a) May the hypothesis of no interaction be rejected at the 5% level? Must 
one assume in applying this test that the three original groups of rats 
were random samples from the same population? What assumption 
concerning sampling is involved in the test of interaction? To interpret 
the results of this test satisfactorily, is it necessary to assume that the 
three groups are random samples from the same population? Explain. 


b) In the test of the hypothesis of no interaction, to what distributions do 
the assumptions of normality and homogeneity of variance apply? In 
view of the criterion measure employed, have you any a priori reason to 
question these assumptions? In other experiments, what has usually 
been the form of the distribution of trials to extinction? Is there usually a 
tendency for the variances and the means of such distributions to be 
related? How would you justify the usual F-test in spite of the failure to 
satisfy the underlying assumptions? How would you modify the test, if 
at all? 


c) The F obtained in (a) was not significant, yet the facts of generalization 
established in other experiments lead one to expect an intrinsic inter- 
action. How may one reconcile the findings with the hypothesis that 
there is an interaction? 


d) In view of the strong a priori case for interaction, would you employ a 
higher or lower level of significance in the test of (a) than if there were no 
such a. priori considerations? Explain. 


e) On the assumption that there is no interaction, how, in general, would 
one proceed with the analysis of acquisition and extinction effects? How 
would one proceed on the assumption that there is an intrinsic inter- 
action? 


f) Make a free-hand graph, for each acquisition interval, of the criterion 
means as a function of extinction intervals (on the abscissa). How can 
the hypothesis of no interaction be interpreted in terms of these graphs? 


STUDY EXERCISES 219 


A non-significant interaction is consistent with the hypothesis that if 
these graphs were based on population values, they would exhibit what 
relationship to one another? 


g) Test the simple effect of the E factor (extinction intervals) for the 30- 


second acquisition interval. Define the population involved. Why is 


mS, for the whole experiment undesirable for use as an error term here? 
What is the most appropriate error term for testing Mu — Mı? (The 
first subscript identifies the A-category, the second the E-category.) 


h) The ratio ms4/ms, = 10.36 is significant at the 1% level, according t^ 


3 


the F-table. Why might the risk of a Type I error nevertheless be largcr 
than 1 in 100? According to the results of the Norton study, one may be 
fairly certain that the true risk of a Type I error does not exceed what 
value? 


How do the conditions on which ms4/ms, is distributed as F differ from 
those on which msa4z/ms, is distributed as F? 


Which of the following inferences (for the given experimental situation) 
are justified by the outcome of the test in (h)? Which can be justified 
only depending on the outcome of other tests in addition to that of (h)? 
Explain in each case. Suggest, where necessary, changes or qualifica- 
tions that will make each statement a reasonable statistical inference 
from the data of this experiment. 


1. The longer the intertrial interval during acquisition, the greater is 
the average resistance to extinction. 


2. The average number of responses for all extinction intervals needed 
to extinguish is not the same for 30-, 45-, and 90-second acquisition 
intervals. 


3. Regardless of what extinction interval is used, a 45-second acquisi- 
tion interval produces greater average resistance to extinction than 
does a 30-second acquisition interval. 


4. The observed differences among the average number of trials 
needed for extinction for the various A-E combinations (cells) 
cannot be attributed to chance assignment of rats to subgroups. 


k) What assumptions would be necessary if one wished to extend the con- 


) 


clusions from this experiment to rats in general? Is the use of only male 
rats a serious limitation of this experiment? 


What are some of the advantages of the factorial design as used in this 
experiment? 


Three-Dimensional Designs 


Introduction 


A factorial experiment involving two factors (A X B) may itself be repli- 
cated in various ways. In some instances, the various replications may be 
regarded as a random sample from a population of such replications. This 
design, which was suggested at the close of the preceding chapter, might be 
denoted an A x B X R design. In other instances, each replication may be 
performed at a different level of some control variable; this might be called a 
treatments X treatments X levels (A X B X L) experiment. The replications 
may also represent different levels of a third experimental variable (C) and all 
replications together may thus constitute a higher order factorial experiment, 
in this case involving three factors (A X B X C). In still other instances, a 
treatments X levels experiment (in which A represents treatments and L 
levels) may be replicated on a random 
basis (A X L X R). 

The data from such experiments 
may always be entered in a triple-entry 
or three-dimensional table. This table 
may be geometrically represented as a 
parallelepiped, consisting of s layers 
parallel to the front face (or r layers 
parallel to the top face, or c layers 
parallel to the side face). The figure 
to the left illustrates one such parallelepiped in which r — 3, c — 6, and 
s=4, 


Analysis of Total Sum of Squares 


Any parallelepiped of the type suggested above may be regarded as consist- 
ing of re smaller elongated parallelepipeds or oblongs, each ending in one of the 
re cells in the front (or back) face. One such oblong is shown in the figure 
below — that ending in the ith cell in the jth column of the front (RC) face. 

220 


= 


— 


ANALYSIS OF TOTAL SUM OF SQUARES 221 


Each such oblong will consist of s cells, one for each of the s-layers. The 
parallelepiped may similarly be regarded as consisting of rs oblongs running 
from side to side (ending in the RS 

face) or of cs oblongs running from ob 
top to bottom (ending in the CS " 
face). We will use the notation obs 

RC to represent oblongs ending in the 

RC face, obs RS for those ending in ell tik 
the RS face, etc. R 

In the notation for this design, the 
first subscript will always refer to a Layer i 
category in the R classification (a row S 
in the front face of the parallelepiped) 
the second subscript to a C category 
(column in the front face) and the 
third subscript to an S category. Thus, M ;;, will represent the mean of the cell 
which lies in the ith R category, the jth C category, and the kth S category. A 
dot in the subscript will indicate that all categories in the corresponding classi- 
fication are represented. Thus, n.;. represents the number of cases in the jth 
category of C (all cases in the jth layer in the accompanying figure) including 
all R and S categories represented in that C category; M.;, represents the 
mean of the oblong formed by the intersection of the jth C layer and the kth S 
layer; etc. A term with no subscripts applies to the entire table. Thus, 7 
represents the grand total for the entire table. 

If the numbers in the corresponding cells are in the same proportion for any 
two oblongs in the same layer, and if the numbers in corresponding oblongs are 
proportional for all layers, the total sum of squares may then be analyzed into 
a number of independent components in the manner explained in the re- 
mainder of this section. 

Let us first disregard the S classification completely, and regard the parallele- 
piped as consisting of obs RC ending in the RC face. Let us then think of the 
data for each oblong as projected into the RC face, so that all of the data are 
represented in an R X C table. In this table we may then analyze the total 


sum of squares into 


C Layer į 


S87 = SSR + SSe + SSnc F SSw obs RC (65) 


the last term being the sum of squares for within-obs RC. 
Similarly, by regarding the parallelepiped as consisting of obs RS, we could 


write 
ssp = SSR + 883 + SSRS + SSw obs RS) (66) 


and in the same manner 
sr = SSc + 88g + SSos + $8, obs CS- (67) 


Finally, considering the parallelepiped as consisting of res cubicles or cells, 


we may write 
SST = SSoells + SSw (68) 


88, representing the within-cells sum of squares. 


222 THREE-DIMENSIONAL DESIGNS 


Now suppose that in each of the obs RC, a constant amount (equal to the 
deviation of the oblong mean from the general mean of the entire table) is 
added to or subtracted from each measure in the oblong, thus making each 
oblong mean equal to the general mean. "These corrections would reduce to 
zero the sums of squares for R, C, and RC in the corrected table, leaving only 
the sum of squares for within-obs RC. Thus the sum of squares for total 
(ss) in the corrected table is equal to the sum of squares for within-obs RC 
(88, obs nc) in the original table. That is, 


SST = SSw obs RC- (69) 


Now let us regard the once-corrected table as consisting of obs RS. The sum 
of squares for total (ssy) in this table may then be analyzed as follows: 


SST = SSR + 885 + SSRs + SSi obs RS- 


In the corrected table, however, ss; has been made equal to zero by the 
corrections in the obs RC. Hence, 


SST = 888 + SShes + SSi obs RS- (70) 


Now let us apply a correction to the measures within each of the obs RS in 
the corrected table, so as to make the mean of each of these oblongs equal to 
the general mean for the entire table. In this twice-corrected table, the sums 
of squares for S and RS will each reduce to zero, so that all that is left is the 
variation within the RS combinations. That is, the sum of squares for total 
(ss7) in the twice-corrected table is the same as ss% os ps, OF 


SSP = SSw obs RS- (71) 


Let us now regard the twice-corrected table as consisting of obs CS. We may 
then analyze the sum of squares for total in the twice-corrected table as fol- 
lows: 


SST = SSC + S85 + SSos + SSi obs C8- 
Because of the two series of corrections 
ssc = 0 and sss = 0, 
and hence, 
SST = SSCs + SSw obs Cs- (72) 

Now let us apply a third series of corrections, this time to the measures 
within obs CS so as to make the mean of each of these oblongs equal to the 
general mean. This would reduce to zero the sum of squares for CS, so that 
the sum of squares for total in the thrice-corrected table would be given by 


"n 


SST" = SS» obs CS- (73) 


In this thrice-corrected table, there would be no differences among the 
means of R-layers, or of C-layers, or of S-layers, or of any oblongs however 
viewed. There would, however, still be some residual differences among the 


ANALYSIS OF TOTAL SUM OF SQUARES 223 


cell (cubicle) means, so that the sum of squares for total in the thrice-corrected 
table could be analyzed into 
ssp’ = ssn, t+ $80. (74) 


We may now note that because of the proportionality of cell frequencies, 
the first correction (within-obs RC) would have no effect on sss and ssrs. That 
is, 

$85 = SSs and SSRs = SSRs- (75) 


The first two corrections similarly would leave 

880s = SScs, (76) 
and all three corrections together would leave 
= $8». (77) 


We will now call the sum of squares for between cells in the thrice-corrected 
table the sum of squares for “triple” interaction or for RCS. That is, we will 
let 


"n 


SSw 


SS&ls = SSRCS- (78) 


Now, substituting from (77) and (78) in (74), we get 
SST' = SSROS + SSw. (79) 
From (73), (76), and (79) in (72), we get 
ssp = SSos + ssros + SSv. (80) 


From (71), (75), and (80) in (70), we get 


Ssp = SSs + SSRs + SScs + S8ros + 58v (81) 


and from (69) and (81) in (65), we get 
$$ = SSR + SSo + 883 + SSnc + SSRs + SSos + SSncs + SSw- (82) 


All the right hand terms in this expression may be computed from the 
original table (see (65), (66), (67), and (68)), but sszcs may most readily be 
obtained as a residual. Thus, all the terms on the right of (82) may be com- 
puted from the original table without actually having to make any of the 
arithmetic corrections. 

The degrees of freedom for each of the sums of squares computed directly 
from the original table are determined as in preceding analyses. The degrees 
of freedom for ssgcs may then be obtained as a residual as follows: 


dfres = dfr — dfr — dfc — dfs — dfre — dfrs — dfcs — dfe, 
or 


dfecs = (N-3) - (r0 - (c-0- 6-2- (r- D(c- 
— (r— 1)(s — 1) - (e= D(s — 1) — (N — res), 


"$$ + SOUss + SISS + SUSS + Iss 4 SSS + Ogg 4 Ass = Iss 


+ ss + Ogg + USS = Iss 


nss + SOUss + Soss + SUss + Sss 
"SS = Dag 
m 
= dgg = S2 #9 "gg 
ud. LL 
SONsg = Ogs 
ui 
SOs = SOgg 
u 
= Lss = Su ey ngg 
=e 
0 = fee 
Suss = SUgs 
4 
0 = 2ss = dgg = 04 mengg 
4 
Sss = Sss 
DUgg 
0-7 "ss = Iss 
oss 
uss 
Ca = 5? y) Qu =° ""W) GA = ^" yy) 
9]qU ], pajoa.107)-221. J, NWL paj24407)-axim J, 3]qn ], poj2a1407)-29u() AWI, purbo 


suon2e110) 3tjeuuiry jo poujew Əy; jo suoypoddy eAisse»»ng Aq 


sjueuoduio 4u£i3 oyu! ejqpj ADAM-o91uJ ur se1pnbg jo wing |040} 40 sisÁ|buy 


4 318V1 


COMPUTATIONAL PROCEDURES 225 
which reduces to 
dfacs = (r — 1)(c — Y) (s — 1). 


This analysis is summarized in Table 7 on page 224. 


Computational Procedures 


The procedures for computing the various sums of squares in a three-way 
table are essentially the same as those employed in the two-way table. The 
sum of squares for the treatments in the row classification (ssz), for instance, 
is obtained just as in a double-entry table by means of the formula 


n : qe 
SSR = Mn. = N 


in which T;.. is the sum of all criterion measures in the ith R-layer (correspond- 
ing to the ith row in the front face). The sum of squares for each of the other 
main effects (C and S) is similarly computed. 

The sum of squares for any two-factor interaction is obtained by first com- 
puting the sum of squares for between-combinations of those two factors, and 
ihen subtracting the sums of squares for the main effects of those factors. We 
will let sszs represent the sum of squares for between-obs RS, or for between- 
RS-combinations. The sum of squares for between-combinations of the R and 


S factors is then given by 
"2 


or T 
SES = Tia/nia — =r 
Lr RS »» r/ k N 
in which T;, is the sum of the criterion measures in one of the obs RS. The 
sum of squares for the RS interaction is then given by 


SSpS = SSRS — SSR — SSs. 


The “triple” interaction sum of squares (ssrcs) is similarly computed by 
first computing the sum of squares for between-combinations of all three fac- 


tors (between-cells), given by 
seir T 
SSpOS — 3X XTana EUN 
k=l j=l i=l 


and then subtracting the sums of squares for all main effects and for all 


two-factor interactions. i j 
The formulas for computing the various sums of squares are given in the 


following summary table: 


226 THREE-DIMENSIONAL DESIGNS 
TABLE 8 Computational Formulas for a Three-Classification (R X C X S) Design 
Source of 
Variation df Sum of Squares 
- jig 
Rows (R) nc ssr = ZT.. /ni.. ER 
[ - T 
Columns (C) | c— 1 ssc = DTE T aN 
j=l 
a " a he 
Slices (S) s—1 385 = oT? Kn. y N 
k=l 
RC (r - 1)(e— 1) SSnc = 2 2 Tii/nis. Ene (ssr + ssc) 
j=l i=) 
3 f Te 
RS (r— 1)(s— 1) | ssrs= EXE. vom (ssr + sss) 
s c ma. 
CS (e — 1)(s— 1) | sses = ET. -g T (sso + sss) 
Ska 
Bi (1: et T? 
RCS (r.— (c —1)x| ssros = Tania — oe — 
El j=l ici N 
(s — 1) (ssr + SSc + sss + ssro + SSRs + sses) 
Within- N — res D DS Ne SIO PNE A 
cells (w) k=l j=l i=l k=l j=l i=l 
A4 wd. * 2 
Total N-1 ssr = > aes 


k=l j=l i=l 


Illustration of Computational Procedure: A convenient way of organizing the 
data from a three-factor experiment for computational purposes is illustrated 
below for a design involving two levels of factor A, four of B, and three of C. 
with varying numbers of subjects in the B categories. This may be denoted a 
2 X 4 X 3 design, to indicate the numbers of levels of the various factors. 


A 
Bi By B; Bi 
€ G €; & G €; G € C; € C; C, 


20| 16| 17| 17 


104 90 115 181 139 149 


COMPUTATIONAL PROCEDURES 227 


This table indicates that for the four subjects receiving treatment A1BiC, 
the criterion measures were 9, 7, 7, and 14, with a total (Ti) of 37, ete. The 
first step is to compute the sums for the various treatment groups. A two- 
dimensional table should be prepared for each possible pair of factors, disre- 
garding the other factor. Such tables for the preceding data are presented 
below. Thus the entry in the upper left cell of the AC table, 278, is obtained 
by adding the four A;C, sums, i.e., 37 + 64 + 104 + 73 = 278; the entry in the 
lower right cell of the BC table, 157, is obtained by summing the two B,C; sums 
i.e., 69 + 88 = 157, etc. 


A ET 


480 Tolal.868 1160 2028 B. 


Total 
868 1160 2028 Tolal 703 638 687 2028 


The subsequent calculations may be indicated as follows: 


ssr o— 9 +T +... +23? +17 — (2028)"/138 = 33,260.0 — 29,802.8 
= 3457.2 
E 454? + 53? + 612 + 647 +58" 6+... +93 
n s 6 


2 2 2 2 
mE ape s n tlc. sana = ds 


5$, = 3457.2 — 1453.8 = 2003.4 


SScetts = 


2 2 
sa = Se 4 nom — 29,902.8 — 617.8 


327 , 443? | 778" an si = 
E: EIS TE + |- 258028 = 3941 


i 


SSB 


2 2 
stc OE a oe 21:087 1 — 29,802.8 = 49.9 


2 2 2 2 2 2 ze? 
144 5 183 q 190" + 253 4 309 + 469 un 225 + 255 ] — 29,802.8 


Nh Mes 18 24 15 


— 617.8 — 394.1 — 119.1 


2 2 2 
ange anm + 201" + 347? + 299* + 388 ‘| peer: 
2 — 49.9 = 92.3 


228 THREE-DIMENSIONAL DESIGNS 


98° + 1187+ 111° , 156? + 1327+ 155° , 285" + 2297+ 264* 
E 8 | 16 
2 zo? etd 
19 t m + ] 29,802.8 — 394.1 — 49.9 — 110.1 


$84nc = 1453.8 — 617.8 — 394.1 — 49.9 — 119.1 — 92.3 — 110.1 = 70.5 


+ 


(Note that in every case the divisor is the number of measures on which each 
squared term is based. If these n’s are not the same for all terms involved, 
separate divisions must be performed before summation.) 


Summary Table 

Source df ss ms 

A 1 617.8 617.80 
B 3 394.1 131.36 
C 2 49.9 24.95 
AB 3 119.1 39.70 
AC 2 92.3 46.15 
BC 6 110.1 18.35 
ABC 6 70.5 11575 
w cells 114 2003.4 17.56 

Total 137 3457.2 


Meaning of Triple Interaction 


The observed triple interaction is measured by the residual variability in 
the cell means after corrections have been applied that make the means of 
all oblongs (in any direction) equal to the general mean. In a table so cor- 
rected, the sums of squares for all main effects and for all two-factor inter- 
actions will be equal to zero, but there will still be differences among the 
cell (cubicle) means. If these differences are too large to be reasonably 
attributed to random Type S fluctuations in the cell means, we say that the 
observed triple interaction is significant. 

To say that there is a triple interaction also means that corresponding 
interaction effects (see page 118) of two of the factors differ in magnitude 
from level to level of the third factor. For instance, if corresponding inter- 
action effects for the various R-C combinations (cells) are not the same for 
all s-layers, then there is a triple interaction in the table. This is also equiva- 
lent to saying that there is a simple interaction between the S factor and the 
RC interaction, (ssgcs = $S(re)s), or between the R factor and the CS inter- 
action, (ssres = s$g(cs), etc. It does not matter in what order the factors 
are arranged in the subscript, that is, ssecs = sscsg = SSscr, etc. 

The meaning of triple interaction is perhaps most readily understood in 
the case of a 2X 2 X s table, such as that diagrammed below. Let jin, 


MEANING OF TRIPLE INTERACTION 229 


pm, etc. represent the true means corresponding to Mii, Min, etc., respec- 
tively. If (un — paz) — (um — pex) Æ 0 for any given value of k, we would 
say that there is an interaction be- 
tween the R and C factors so far as 
the corresponding level of S is con- 
cerned, Tf (uu. — jio) — (um — uo) 
differs for different values of k, that 
is, if it differs from s-layer to s-layer, 
then a triple interaction is present. To 
test the hypothesis that there is no 
iriple interaction, one would ascer- 
tain whether the observed variance of D = (Mur — Mig) — (Mai — Max) for 
the various values of k is significantly larger than the variance which would 
be expected as the result of sampling fluctuations in the cell means. This, 
as we shall see later, is what is accomplished by the F-test based on the ratio 
MSres/MSw. 

From any rX c Xs table one can select a number of 2 X 2 X s tables, 
each of which consists of obs RC ending in the corner cells of a rectangle 
drawn on the RC face (see the accompanying drawing). If, in any 2 X 2 X s 
table, there is a triple interaction as 
just defined, then there is a triple in- 
teraction in the entire table. The 
F-ratio msrcs/MSw for the entire table 
lests the hypothesis that there is no 
triple interaction in any 2 X 2 X s or 
2x2Xr or 2X2Xc table taken 
from the entire table. [This F-test, 
however, is not very sensitive to a 
triple interaction affecting only a small 
part of the whole table (see page 140)]. 

It should be apparent that the triple interaction may be much more pro- 
nounced in one of the component 2 X 2 X s tables than in others, or that there 
may be no triple interaction in some such tables and some triple interaction 
in others. It is possible, in other words, that the triple interaction for the 
whole table is not homogeneous. Heterogeneity of triple interaction, how- 
ever, may be quite independent of heterogeneity of any two-factor inter- 
action, either for the table as a whole or for any layer. 

In the twice-corrected table of page 222, in which all obs RC and obs RS 
means equal the general mean, the triple interaction may also be regarded as 
measured by the differences among cell means within obs CS. If corrections 
have made all obs RS and obs CS means equal to the general mean, the triple 
interaction would similarly be measured by differences among cell means 
within obs RG. It is also measured by differences among cell means within 
obs RS after differences among obs RC and obs CS have been eliminated. 

Each of the interactions is independent of each of the others. It is quite 


230 THREE-DIMENSIONAL DESIGNS 


possible, for instance, that there is no RC interaction for the entire table 
even though there is a triple interaction. This would mean that the differing 
RC interactions in the various s-layers tend to “average out” to zero. That 
is, in a table in which differences among obs RS and obs CS have been elim- 
inated, thé true means of obs RC might all have the same value, even though 
there are real differences among cells within obs RC. However, if the true RC 
interaction were the same for all s-layers, the RCS interaction would be 
zero, even though the RC interaction is pronounced. In a table in which 
corresponding cell frequencies are proportional from layer to layer (in any 
direction), each component of the total sum of squares (82) is independenl 
of all others. 


Applications of Three-Dimensional Designs 


As noted earlier (page 220), there is a variety of experimental designs 
for which the data may be presented in a three-way table, and in which the 
total sum of squares may be analyzed in the manner just considered. We will 
now consider the various tests of significance and their interpretation for 
each of the principal applications. 


Random Replications of a Two-Factor Experiment (A x B x R Designs) 


We will consider first the design which involves random replications of a 
two-factor experiment. We will let A and B represent the experimental 
factors and R represent the random replications of the basic A X B experiment. 
Each of the R-layers in the thrée-way table will then represent an independent 
two-factor experiment of the type considered in Chapter 9, while the r-layers 
or replications represented in the entire experiment will be regarded as a 
simple random sample from a hypothetical population of such layers or 
replications. The replications may be “simple” replications, in the sense 
that the subjects involved in each replication are a random sample from the 
same population. On the other hand, the entire population may consist of a 
very large number of subpopulations. A relatively small number (r) of these 
may have been selected at random for the purposes of the experiment, and from 
each of these selected subpopulations a number of subjects may have been 
drawn at random. 

Testing the ABR Interaction: In general, in experiments involving random 
replications of a two-factor experiment, there is relatively little interest 
in a test of the triple interaction. In many such instances it may be taken 
for granted that either experimental errors (Type G) which vary from repli- 
cation to replication, or differences among the subpopulations sampled, or 
both, will result in a non-chance triple interaction. In general, also, the 
significance or non-significance of the ABR interaction will have little bearing 


A X B X R DESIGNS 231 


on the rest of the analysis. However, the significance of the triple interaction 
may be tested by means of 


F = msagg/msy, df = (a — 1)(b — 1)(r — 1)/(N — abr). (83) 


The conditions under which this mean square ratio is distributed as F are 
as follows: 


1) The measures in each cell of the table are a simple random sample 
from a corresponding subpopulation. 


2) The distribution of criterion measures for each of these subpopulations 
is normal. 


3) Each of these distributions has the same variance. 


4) The triple interaction (ABR) is zero for the entire population. 


Since proportionality of cell frequencies is required in the analysis of the 
total sum of squares into its components, this requirement will not be repeated 
for each of the tests here considered, but is always actually one of the under- 
lying requirements. 

To prove that under these conditions msazr/ms, is distributed as F, we 
may note first that in the thrice-corrected table on page 223, the total sum 
of squares may be analyzed into 


SST” = SSccts + S85. 
Accordingly, under Conditions 1, 2, and 3, the design in the thrice-corrected 
table may be regarded as a simple-randomized design. We know that in any 
simple-randomized design, MSgroups/MSw groups IS distributed as F. In this 
CASE, MSgroups = MS1s ANA MSw groups = MSi’. We know also that ms. = 
ms4pg and that msi’ = ms». Accordingly, 


F = MS groups/MSw groups = MSABR/MSw. 


In deciding to what extent Conditions 1 to 3 are satisfied in a particular 
experiment, the important considerations are essentially the same as those 
discussed in Chapter 3, pages 72-90, and again in subsequent chapters. 

Testing the AB Interaction: As noted in the preceding section, there is 
usually little interest in the triple interaction in an A x B X R experiment. 
Usually the existence of a triple interaction is taken for granted, and the 
mean square for this triple interaction is used as the error term in testing 
the AB effect. The use of the ABR interaction as an error term, however, 
requires either that the number of observations be the same for all cells in 
the table or that the analysis be based on the unweighted cell means. If the 
number of cases per cell is constant, the ratios used in the tests of significance 
are the same as when the analysis is based on the individual measures, since 
any sum of squares based on the individual measures is n times the corre- 


232 THREE-DIMENSIONAL DESIGNS 


sponding sum of squares based on the unweighted cell means. Since the 
proofs may thereby be simplified, the discussion from this point on will be 
concerned primarily with the case in which the analysis is based on the un- 
weighted cell means. As in preceding chapters, we will employ an asterisk 
to indicate that a sum of squares or a mean square is based on the unweighted 
cell means. 

The hypothesis that there is no interaction of A and B in the population 
from which the particular replications involved in the experiment are a 
random sample may be tested by 


F = msás/msásn, df = (a — 1)(b — 1)/(a — 1)(b — 1)(r — 1). (84) 


The conditions under which this mean square ratio is distributed as F are 
as follows: 


1) The replications represented in the experiment are an independent 
simple random sample from a hypothetical population of such replica- 
tions. 


2) The triple interaction effects are normally and independently distributed 
in the population for each treatment-combination. 


3) The triple interaction in the population is homogeneous (has the same 
population variance for all 2 X 2 X r tables in the entire table). 


4) The AB interaction for the entire population is zero. 


It should be noted that if n;; is variable, even though these n’s are in 
the same proportion from row to row within each replication, ms45/msanz 
is not distributed as F. 

To prove that msís/msigg is distributed as F, let us suppose that in 
the three-way table corrections have been applied within obs AR and obs BR 
independently, so as to make their means equal to the general mean. Let 
us regard this twice-corrected table as consisting of obs AB. Each of these 
obs AB is divided into r cells, corresponding to the r replications. In this 
twice-corrected table, there will be no within-cells component of the total 
sum of squares, since the analysis is based on the cell means. Accordingly, 
the total sum of squares in the twice-corrected table may be analyzed (see 
page 222) into 


ek ne E 
SST" = SSobs AB + SSw obs AB (85) 


s i x : 
in which 55755, Ag = ssigg, since ss* = 0. 


From (85) it is apparent that the design in the twice-corrected table may 
be regarded as a simple-randomized design, in which MSE oupa/ 1185, groups! 18 


distributed as F. In this case, soup. = SS% an = s$45 (see Table 7. page 


A X B X R DESIGNS 233 


* E * * * 
224) and SS) groups 7 SSwobs AB = SSApg. Hence, msip/msipmg = m$4p/msAnn- 
Accordingly we may write 


= * OE * 
F- MSgroups/! MSw groups = msip/MSapr 


If ABR has already been shown to be significant, or if on a priori considera- 
tions we may take an ABR interaction for granted, we know that there is 
an AB interaction in individual replications, but that it varies from replica- 
tion to replication. However, it is conceivable that these AB interactions 
cancel one another so far as the whole population is concerned, or that there 
is no AB interaction for the population as a whole. If F = msin/msisr 
proves non-significant, we might assume that there is no AB interaction for 
the population as a whole, unless we are prevented from doing so by other 
(a priori) considerations. 

If ABR is due to random Type S and Type G errors only, as it presumably 
would be in the case of “simple” replications in which Type G errors 
have been randomized for each replication independently, a significant 
F = msis/msiar would indicate that there is an intrinsic AB interaction. 
This is the situation referred to at the end of Chapter 9 (page 216). If the 
replications are not “simple,” and if ABR is in part intrinsic, one may still 
more surely infer from a significant F = msx/msiee that there is an intrinsic 
AB interaction. A very important feature of this design, then, is that the 
test of significance enables one to test the hypothesis that there is an in- 
trinsic AB interaction — a test which could not be made in a simple A X B 
experiment. 

It should be noted that the test F = msiz/msazr is valid whether or not 
ABR is significant. (This is another reason why there is usually little in- 
terest in a test of the ABR interaction.) This test of AB must always be 
used if ABR is significant, but if the observed ABR is non-significant, and 
one may assume that there is no intrinsic or extrinsic triple interaction in 
the population, the AB interaction may be tested by 


Ssape E $e ge _ (a 1)(b— D/I(a— DO- 1) — 1) + N — abr). 
dfAnn + dfw 
(86) 
The conditions under which this mean square ratio is distributed as F are 
as follows: 
1) The measures in each cell of the table are an independent simple random 
sample from a corresponding subpopulation. 
2) The distribution of criterion measures for each of these subpopulations 
is normal. 
3) Each of these distributions has the same variance. 


F = msas 


4) The ABR interaction for the entire population is zero. 
5) The AB interaction for the entire population is zero. 


234 THREE-DIMENSIONAL DESIGNS 


It should be noted that in this case it is not necessary that n be constant, 
although proportionality of the n's is necessary for computational purposes. 
It may be noted also that the replications need not be regarded as random 
replications. This latter feature will be of special consequence in later designs. 

'To prove that under these conditions the mean square ratio in (86) is 
distributed as F, let us suppose that for each replication independently 
arithmetic corrections have been applied within rows and columns so as to 
make all row means and all column means equal to the general mean for that 
replication. 'The only remaining variations within each replication are then 
those due to between-cells and to within-cells. Each replication thus cor- 
rected may therefore be viewed as a simple-randomized design, with the ab 
“treatments” corresponding to the ab treatment-combinations. The cor- 
rected data for the complete experiment may therefore be presented in a two- 
way table, with ab columns corresponding to treatment-combinations and r 
rows corresponding to replications. This corrected two-way table may then 
be regarded as corresponding to a treatments X random replications design 
of the type considered in Chapter 8. 

We will let ssr, ssc, ssc, and ss; represent the components of the total 
sum of squares in an analysis of the corrected data in this two-way table. 
It should now be apparent that msc in the corrected two-way table is the 
same as ms4s in the original table, that msp = msg, that msg = msan, 
and that ms; = ms. Accordingly, if there is no ABR interaction in the origi- 
nal table, then neither will there be any RC interaction in the corrected two- 
way table. Therefore, by the reasoning underlying (54) on page 196, the 
significance of the columns effect may be tested by 


Y , /fSSkc + ssw SSApR + 88v 
4 mif arora mue "i dfAsn + df’ 
which is what we set out to prove. 

Evaluating the AR and BR Interactions: If the observed ABR interaction 
is significant and there is presumably an intrinsic ABR interaction, neither 
ms4n/mS4nn NOY MSer/MSpra is distributed as the ordinary F with (a—1) 
(r-1)/(a—1)(b—1)(r—1) degrees of freedom, since neither A nor B is a 
random effect. However, we may use F = msar/ms, to test the hypothesis 
that in a certain hypothetical population the AR interaction is zero (and simi- 
larly for the BR interaction). This hypothetical population is first of all 
restricted to the subpopulations corresponding to the particular replications 
involved in the experiment, the numbers in the various subpopulations being 
proportional to the corresponding numbers in the replications of the experi- 
ment. In this hypothetical population, some of the members have received 
Treatment Bı, some Treatment B» some Treatment Bs, etc., the members 
receiving each B-treatment being in the same proportion as in the experi- 
mental samples. Since this hypothetical population is nearly always an 
artificial population unlike any real population in which the experimenter is 


A X B X R DESIGNS 235 


interested, there is ordinarily little interest in any test of the significance of 
the AR and BR interactions. 

Frequently, in experiments of this kind, the mean squares for AR, BR, 
and ABR will not differ significantly, and the assumption will be plausible 
on a priori grounds that these interactions are all of the same strength. 
In that case, a pooled error term may be obtained by adding the sums of 
squares for AR, BR, and ABR, and dividing by the sum of their degrees of 
freedom. "This pooled error term may then be used to test the significance 
of the AB interaction. Proof that under these conditions the ratio of msaz 
to the pooled error mean square is distributed as F is similar to the preceding 
and need not be presented here. 

Testing the Main Effects of A and B: If the AB interaction has been proved 
non-significant and if other considerations also permit us to assume that 
there is no AB interaction in the population, we will then be interested 
in the main effects of A and B and wish to test their significance. The main 
effect of either A or B can of course be tested whether or not the AB inter- 
action is significant, although if the AB interaction is significant we would 
ordinarily be interested only in the simple effects of A and B (see page 218): 

Whether or not the AB interaction is significant, the main effects of A and B 
may be tested by 

F = mss/mság (87) 
and 
F = ms}/mspr- 


The conditions under which msž/ms%pg is distributed as F are as follows: 


1) The replications represented in the experiment are a simple random 
sample from a hypothetical population of such replications. 


2) The AR interaction effects are normally and independently distributed 
in the population for each A-treatment. 


3) The AR interaction effects in the treatment populations are homogeneous 
in variance. 


4) In the treatment populations the unweighted mean of the subpopulation 
means is the same for each treatment. 


The conditions under which ms5/msgm is distributed as F are exactly 
similar. (If the number of cases differs for the various A-R combinations, 
F = msa/msar is not distributed as F (and similarly for msz/mszr).) 

To prove that under these conditions ms%/ms‘p is distributed as F, let 
us assume that arithmetic corrections have been applied to the original 
data so as to eliminate ms}, msin, and mságm. (To eliminate msApn, for 
example, one would apply a correction within each A-B-R combination; the 
correction would be equal to the ABR interaction effect for that combination, 
and there would then be no interaction effect for the corrected measures.) 


236 THREE-DIMENSIONAL DESIGNS 


These corrections, of course, would leave msi and msi; unaffected. In any 
one replication, there would then be only chance differences among the dis- 
tributions of criterion measures for the various AB combinations in each 
A-category. Accordingly, in each replication all the observations for Treat- 
ment A, could be regarded as a simple random sample, and similarly for each 
of the other A treatments. That is, so far as the corrected measures are 
concerned, each replication would constitute a simple-randomized experiment. 
Accordingly, so far as the corrected measures are concerned, the entire experi- 
ment constitutes a random replications design, in which, under the hypothe- 
sis of equal treatment means, ms1/ms15 is distributed as F. 

The proof that ms5/mszg is distributed as F is exactly similar to the pre- 
ceding. 

The differences in the interpretation of the main effect of A when the 
observed AB interaction is significant and presumably partly intrinsic, and 
when it is non-significant and there is presumably no AB interaction in the 
population, are the same as in the case of the simpler A x B design (see 
pages 209-211). 

If the analysis is based on the individual measures rather than on the 
unweighted cell means, if all the observed interactions prove non-significant 
when tested against ms,, and if there is presumably no interaction of any 
kind (either intrinsic or extrinsic) in the population, the main effects of 
A and B may be tested against an error term obtained as follows: 


MSerror = (SSAB + SSAn + SSBR + SSAnR + SSw)/ (df An + dfar + dfer+ df Ann-t- dfo). 


In this case, all the interactions are presumably due to random Type S errors; 
hence it is legitimate to pool all these sums of squares to get a more stable 
error term. The main effects of A and B would then be tested by 
F = MS4/MSerror and F = msp/msa,;, the degrees of freedom for the error term 
being N—a— b —r--2. There will obviously be very few occasions on 
which this test of significance may be used. 

Testing the Simple Effects of A or B: If msi; is significant and an interac- 
tion presumably exists, neither ms4/msig nor ms}/ms%p is distributed as F 
(see page 214). In this case, these mean square ratios may be evaluated as 
in a two-factor design (see pages 214-215), but the safest procedure is to 
test the effect of the A factor for each level or category of B separately, or 
the effect of the B factor for each level of A separately, and to make no 
attempt to generalize about A differences for all B categories or about B 
differences for all A categories. To test the simple effect of the A factor 
for any given B category, one would use 


F = msi/msig, df = (a — 1)/(a — 1)(r — 1) (88) 


computed from the data for the particular B category only. In other words, 
one would regard the data for the selected B category as taken from a simple 
random replications design, using the test of the treatments effect already 
proved valid for that design (see page 191). 


A X B X S DESIGNS 237 
It should be noted that what we are now calling a “simple” effect of A 
for a given level of B is also a “main” effect of A with reference to R for 
the given level of B. We could, if we wished, also compute the “simple” 
effect of A for a given replication for all levels of B, or for a given replica- 
lion within a given level of B. "Thus there are three kinds of simple effects 
of A in this three-dimensional design: The average effect of A at a given 
level of B for all levels of R; the average effect of A in a given replication for 
all levels of B; and the effect of A for a given BR combination — that is, for 
a given replication within a given category or level of B. In designs involving 
more than three dimensions, there is a still larger number of “simple” effects 
for any given factor. To avoid confusion, it will be best henceforth to specify 
exactly what effect is intended in each case. In this design, for instance, it 
would be better to speak of the "average effect of 4 at a given level of B 
for all replications” instead of “the simple effect of A for a given level of B" — 
noting that “average” effect may mean a weighted average if the number of 
cases in the various treatment combinations differs from replication to 
replication. 

The average effect of the B factor, in any given A category, for all replica- 
tions is similarly tested by 

F = ms3/mszp, df = (b — 1)/(b — I(r — 1) 
computed from the data for the particular A category only. 

On the assumption of homogeneous AR interaction for all B categories, 
one could use msir computed for the whole table as the error term for testing 
ms4; and similarly mss. 

If n;;-— n is constant, the corresponding F-ratio computed on the basis 
of the individual measures may be used. 


Treatments x Treatments x Subjects (A x B x S) Designs 


Sometimes the various treatment-combinations in a two-factor experiment, 
are such that all may be administered in succession to each subject, without 
introducing any serious order or sequence effects. In this case, a very effi- 
cient experiment can be performed, using a random sample of subjects for 
the population concerning which the inferences are to be drawn. The criterion 
measures can then be entered in a three-way table, in which we will let A and 
B represent the treatment classifications, S the subject classification, and s 
the number of subjects. 

Testing the AB Interaction: The significance of the observed AB interaction 


is tested by means of 
F = msag/msans, df = (a — 1)(b — 1)/(a — 1)(b — 1)(s — 1). 


The conditions under which this ratio is distributed as F (assuming only 
one criterion measure in each cell of the three-way table) are as follows: 


238 THREE-DIMENSIONAL DESIGNS 


1) The subjects represented in the experiment are a simple random sample 
from a specified population of subjects. 


2) The ABS interaction effects in the population are normally and inde- 
pendently distributed for each AB combination. 


3) The ABS interaction in the population is homogeneous. 


4) The AB interaction in the population is zero. 


Proof that under these conditions this mean square ratio is distributed 
as F is exactly similar to the proof of (84) on page 232. 

The important considerations in the interpretation of a significant 
F = msag/msaps are closely similar to those involved in the interpretation 
of a significant F = ms4/msas in the simpler A X S design (see pages 157 
-160.) 

Testing the Main Effecls of A and B: We would ordinarily be interested in 
testing the main effect of A (or B) only if we could assume that there is no 
AB interaction. Whether or not the AB interaction is significant, the sig- 
nificance of the main effect of A is tested by 


F = msa/msas, df = (a — 1)/(a — Y) (s — 1). 
The conditions under which this mean square ratio is distributed as I’ 
(assuming one observation per cell) are as follows: 
1) The subjects involved in the experiment are a simple random sample 
from a specified population of subjects. 


2) The AS interaction effects in the population are normally and inde- 
pendently distributed for each A treatment. 


3) The AS interaction in the population is homogeneous. 

4) The population means for the various A treatments are identical. 

The important considerations in the interpretation of a significant 
are closely similar to those discussed on pages 159-160. 

The proof that this mean square ratio is distributed as F is closely similar 


to the proof of (87) on pages 235-236. 
The main effect of B, of course, is similarly tested by 


F = msg/msgs. 
Random Replications of Treatments x Levels Designs (A x L x R Designs) 
We have noted previously that the analysis and interpretation of a two- 


factor (A X B) design (Chapter 9) and of a treatments X levels (A.X L) 
design: (Chapter 5) are very much alike. In the treatments X levels design, 


A X B X L DESIGNS 239 


the primary interest is in the main effect of only the experimental factor 
(the treatment classification); in the two-factor design the interest is in 
the main effects of both factors as well as in the interaction effect. How- 
ever, the tests of significance and their interpretation are essentially the 
same for both designs, except that the population for which the experimental 
sample is representative with respect to levels of a control variable may be 
a real population, or, if hypothetical, may be closely like some real population 
in which the experimenter is interested. In the factorial. design, however, 
the population from which the experimental sample may be regarded as 
representative with respect to the categories of either factor is an artificial 
population of little practical interest. Hence, main effects in a two-factor 
design are usually of interest only if the observed interaction between the 
factors is non-significant and if there is presumably no interaction in the 
population. 

'The situation is very much the same with reference to random replica- 
tions of two-factor (A X B X R) designs and random replications of treat- 
ments X levels (A X L X R) designs. The tests of significance and their in- 
terpretation are essentially the same in both instances. However, for the 
reasons given, the main effect of treatments (A) in random replications of a 
treatments X levels design will be of interest even though the AL interaction 
is significant, whereas the main effect of one of the treatments (A or B) in 
random replications of a two-factor design will be of little interest if the 
AB interaction is significant. 

It is important to note that in random replication of an A X L design, 
the levels must be the same for all replications if the mean squares for L, AL, 
LR, and ALR interactions are to have any clear and useful meaning. This 
means that if these effects are to be tested, use may not be made of the method 
of constituting levels by “counting off” na subjects at a time (see pages 129 
~132), Instead the method of representative sampling from a real popu- 
lation (pages 128-129) must be used. However, if the levels in each replication 
are constituted by the “counting off” method, in which case the “levels” 
will not be the same for all replications, the significance of the treatment effect 
may still be tested by F = ms*/msiiz, assuming that the AR interaction 
effects are normally and independently distributed for each treatment, and 
assuming also a homogeneous AR interaction. 


Two-Factor Experiments with Matched Groups (A x B x L Designs) 


We may now consider the case in which A and B again represent experi- 
mental factors, but in which L represents a control variable introduced to 
achieve higher precision in the evaluation of A and B. The levels are con- 
stituted in the manner described in Chapter 5, pages 129-132, providing 
for at least two observations in each cell, and a two-factor (A X B) experi- 
ment is then performed at each of these levels. That is, within each level 


240 THREE-DIMENSIONAL DESIGNS 


the subjects are randomly assigned to the various A-B combinations (cells). 
Usually the number assigned to each cell is the same, but this is not neces- 
sary so long as the corresponding numbers are proportional from row to row 
or from column to column within each level. 

Testing the ABL Interaction: With this design, the significance of the 
triple interaction is tested in exactly the same fashion as was the ABR inter- 
action in the A X B X R design (see pages 230-231) and under exactly the same 
assumptions. A significant F = ms45;/ms, indicates that differences between 
corresponding AB interaction effects are not the same from level to level. 
Whether this variation is intrinsic or due to error only is a matter for judg- 
ment, based on considerations similar to those discussed on pages 123-127. 

Testing the AB Interaction: The hypothesis that there is no AB interaction 
for the population as a whole may be tested by means of 


F = msap/msw, df = (a — 1)(b — 1)/(N — abl). (89) 


The conditions under which this mean square ratio is distributed as F 
are as follows: ' 


1) The subjects receiving each A-B combination were originally an in- 
dependent representative sample from the same population (representa- 
tive with reference to the L levels). 


2) The distribution of criterion measures is normal for the subpopulation 
corresponding to each A-B combination in each level (to each cell in 
the three-way table). 


1 To prove that, under these conditions ms,5/ms, is distributed as F, let us suppose 
that, within the original three-way table constant corrections are applied within each 
level so as to eliminate the A and B effects for that level. These corrections would of 
course also eliminate the A and B effects as well as the AL and BL interactions, for the 
table as a whole. That is, for the corrected measures, ms, = msg = ms,; = msg; = 0. 
The corrected measures could then be entered in a two-way table with ab columns cor- 
responding to the various A-B treatment combinations and with | rows corresponding 
to the various levels. The relations between the mean squares in the original three-way 
table and in the corrected two-way table would then be as follows: 


MStows X columns = MS4 py, 
TQ, = $45 
MStows = MSz, 
MS, = ms, 
This table might thus be regarded as corresponding to a treatments x levels design, in 


which the “treatments” are the various A-B combinations. Accordingly, in the cor- 
rected table, the significance of the “treatments” (columns) effect is tested by 


F = ms{,.,/ms, 
which, according to the equalities listed above, is the same as 


F = ms4;/ms,. 


A X B X L DESIGNS 241 


3) Each of these distributions has the same variance. 


4) The AB interaction in the population is zero. 


Note that Condition 1 provides for proportionality of cell frequencies, 
but proportionality is in any case necessary for computational purposes. 

The population referred to in Condition 4 is that from which the total 
experimental sample is representative with reference to the L-levels. This 
is usually a meaningful population and one which, if not real, is at least 
closely similar to a real population in which the experimenter is interested 
(see page 131). Any inferences drawn about the hypothetical population 
may therefore usually be safely extended to the corresponding real popula- 
lion. 

A significant F = ms4z/ms, indicates that the observed interaction may 
not be attributed to random Type S errors only, but it does not rule out the 
possibility that the significance of the observed AB interaction is due to 
experimental (Type G) errors only, with no intrinsic AB interaction in the 
population as a whole. How a significant F is to be interpreted in this re- 
spect, then, depends on the interpreter’s judgment and on the extent to which 
Type G errors have been controlled in the experiment. 

A non-significant F = ms4z/ms» permits one to retain the hypothesis that 
there is no AB interaction in the population as a whole. A non-significant F, 
however, does not rule out the possibility that there is an AB interaction at 
any particular level of L. This possibility could be tested for any given level 
as in a simple two-factor design, using only the data for that level or (on 
the assumption of homogeneous error variance) using the error term (msw) 
obtained from the table as a whole. There would be little point in making 
such tests, however, unless the ABL interaction had first been proved sig- 
nificant for the table as a whole. 

If one can assume that the ABL interaction is significant because of experi- 
mental errors only (no intrinsic ABL interaction), and if experimental errors 
have been independently randomized at each level (this would involve inde- 
pendent administration of the treatments at each level separately), then the 
ABL effect is a random effect. In this case, the interaction may be tested by 


* * 
F = mság/msanL, 


on the conditions listed on page 232. If n:; is constant, the interaction may 
be tested by F = msap/msanz- fom : 

If the observed ABL interaction is non-significant, and if presumably 
there is no ABL interaction in the population, the AB interaction may be 


tested as in (86) and on the same conditions. abun 
Testing the AL and BL Interactions: If A X B X Lis significant and presum- 


ably partly intrinsic, neither msaz/msasx nor msg 1/MSapr 18 distributed as 
F, since neither A nor B is a random effect. However, ms47/msy is distributed 


as F under the following conditions: 


242 THREE-DIMENSIONAL DESIGNS 


1) The subjects receiving each A-treatment at each level of L were orig- 
inally a represenlative sample from the same population (representative 
with respect to the B-categories). After administration of the 
treatments, the subjects in each A-L combination are a representative 
sample (with reference to B) from a corresponding hypothetical 
population. 


2) The distribution of criterion measures is normal for the subpopulation 
corresponding to each cell in the three-way table. 


3) Each of these distributions has the same variance. 


4) The AL interaction in the population is zero (that is, the relative 
effects of the A treatments are the same from one L subpopulation to 
another). 

If ABL is significant and presumably partly intrinsic, the "population" 

referred to in Conditions 1 and 4 is a population in which some members take 

Treatment B; in combination with each of the A-L combinations, some take 

Treatment B; in combination with the A-L combinations, etc., the numbers 

taking the various B treatments being in the same proportion as in the ex- 

perimental sample. This, obviously, is a "population" of little interest. In 
other words, if the AL interaction depends on which B category is involved, 
the interest would be in the AL interaction for each B category sepa- 

rately, rather than in the average of the AL interactions for the various B 

categories. 

The test of significance for the BL interaction is similar to that for AL. 

Testing Main Effects: If AB is significant and presumably partly intrinsic, 
then neither ms4/msag nor msa/msag is distributed as F, since neither A 
nor B is a random effect. In this case, these mean square ratios may be eval- 
uated as in a two-factor design (see pages 214-215), but the safer pro- 
cedure is to test the main effect of the A factor for each category of B sep- 
arately, and the main effect of the B factor for each category of A separately, 
with no attempt to generalize about A differences for all B categories or about 
B differences for all A categories. For any given category of B considered 
alone, the design is an A X L design of the type considered in Chapter 5, and 
the main effect of either A or L may be tested in the manner explained on 
pages 214-215. 

Tf all the two-factor interactions prove non-significant, and if the assump- 
tion is tenable on a priori grounds that there is no intrinsic triple interaction, 
but that the observed triple interaction is due entirely to randomized experi- 
mental errors plus random sampling fluctuations, the main effects of A and B 
can be tested by 

F- ms4/msinr or F = msġ/msăäs. 


The conditions under which these mean squares are distributed as F and 


TESTING DIFFERENCES IN INDIVIDUAL PAIRS 243 


the proofs that they are so distributed are left as an exercise for tbe 
student. 

If all observed interactions are non-significant when tested against ms, 
and one can assume that there is no interaction (either intrinsic or extrinsic) 
of any kind in the population, then one can use ms, as the error term in test- 
ing main effects, or can pool the sums of squares for all interaction with that 
for within-cells, and divide by the sum of their degrees of freedom to provide 
an error term. Proof of this is also left as an exercise for the student. 


Three-Factor (A x B x C) Designs 


We have noted repeatedly that there are no differences in the analysis 
and interpretation of a treatments X levels and a treatments X treatments 
(factorial) design, except that the "populations" concerning which certain 
inferences are drawn are more meaningful in the case of the treatments X 
levels than in the case of the treatments X treatments design. We are there- 
fore generally interested in main effects in an A X B design only if the ob- 
served AB interaction is non-significant and if presumably there is no inter- 
action in the population. 

Exactly the same situation prevails with reference to the A X B X C and 
AX Bx L designs. All the tests of significance that can be used in the 
A X B x L design can also be used in the A X B X C design, C taking the 
place of L. In the A X B X C design, however, we would generally be inter- 
ested in any two-factor interaction in the table as a whole only if the observed 
triple interaction were non-significant and if presumably there was no triple 
interaction in the population. If the ABC interaction is significant and pre- 
sumably partly intrinsic, our interest would be in the interaction of two 
of the factors for each category of the third factor separately. 


Testing Differences in Individual Pairs of Means in Three-Dimensional Designs 


In general, in complete three-dimensional designs, if F = msa/mSerror 18 
valid to test the effect of the A factor, whether for the table as a whole or 
for individual levels or categories or combinations of other factors, then the 
estimated error variance of the difference between any two corresponding A 
means, based on n; and n» cases respectively, is given by 


i EE 
2 
C(Ai-A9 = mss sies) b 


n a 
This is true regardless of what mserror may represent. In some instances 
MSerror may be msasg; in others it may be msar, in others msw, in others 
(ssanc + ss5)/ (dfAnc + dfu), etc. The difference is then tested by the ordi- 
nary t-test with a number of degrees of freedom equal to that for mseror- 


244 THREE-DIMENSIONAL DESIGNS 


Similar statements, of course, apply to other factors or classifications (in- 
cluding levels, subjects, and replications). 


STUDY EXERCISES! 


1. An experiment? was performed to compare the accuracy of visual 
interpolation in circular coordinate plots differing in size of scale inlerval 
(I), and in the amount of illumination of the viewing field, field intensity 
(F). 

Each coordinate plot consisted of a number of concentric circles with equal 
intervals between circles, printed in black on a sheet of very thin, translucent 
paper. This was placed over a sheet of blueprint paper containing 16 small 
(2 mm.) white dots more or less randomly distributed. This combination was 
then fastened to the frosted-glass surface of an inclined projection table 
which was illuminated from behind. The entire display was “masked down” 
to produce a circular opening 24 inches in 
diameter, as shown in the following diagram. 

Four separate coordinate plots were pro- 
vided, one each with l-inch, 2-inch, 3-inch, 
and 4-inch scale intervals (I), i.e., linear dis- 
tance between successive concentric circles. 
(The scale lines themselves were of negligible 
width in comparison with the interval. The 
projection surface was illuminated from behind 
with either a 100-watt bulb producing a dark 
blue background, or with a 300-watt bulb 
which produced a light blue background. In 
the former case, the scale lines were somewhat less distinct, but the white 
dots more conspicuous than in the latter. "These two field intensily (F) con- 
ditions will be identified as /; and F, respectively. (Room lighting was 
maintained at a constant level during every experimental session.) 

The subjects, six experienced aircraft radar navigators, were instructed 
to estimate the distance, along a radius, from the center of the coordinate 
plot to the center of each of the white dots, starting with the one closest to 


1 See second paragraph on page viii. 

* The experiment described here is hypothetical, but was suggested by a study re- 
ported by M. Leyzorek, “Accuracy of Visual Interpolation Between Circular Scale 
Markers as a Function of the Separation Between Markers," Journal of Experimental 
Psychology, vol. 39 (April, 1949), pp. 270-279. In his study, Leyzorek used several 
additional scale intervals and a Latin Square design, but did not vary the field in- 
tensity. However, the techniques used here and the results obtained are quite con- 
sistent with those of Leyzorek. 


STUDY EXERCISES 245 


the center and working counterclockwise around each circle in turn. The 
subjects were told that the scale interval represented 1000 yards and that 
they were to make their estimates in those units. The criterion of visual 
interpolation was obtained by expressing the absolute error in the estimate 
of distance to a given dot as a percentage of the scale interval and taking the 
average of such values for the 16 estimates (dots) made by each subject 
under each of the eight treatment-combinations. Note that although the 
subjects were instructed to estimate the total distance to each dot, this 
criterion is concerned only with the accuracy of interpolation within the 
interval. Thus an absolute error of 20 yards, using any scale interval, repre- 
sents an interpolation error of 2% (criterion value of 2.00), whether the total 
distance from the center of the plot was, say, 2100 or 9600 yards. 

Each of the six subjects reported for a one-hour experimental session each 
day for four successive days. Since all of the subjects had recently partici- 
pated in an experiment using somewhat similar materials, it was assumed 
that there would be no practice (or order) effect with respect to the scale 
interval factor. Hence, it was considered “safe” to administer the scale inter- 
vals in the same order for all subjects. All used the l-inch scale the first 
day, the 2-inch scale the second day, etc. With each interval size, however, 
the order of administration of the field intensity conditions was independently 
determined for each subject by the flip of a coin. Furthermore, since the two 
intensity conditions were administered in the same one-hour session with 
identical scale intervals, it was necessary to use two “equivalent” dot pat- 
terns, A and B. Pattern A was always used for a subject’s first trial at each 
session, and Pattern B for the second. 

The following table presents the average interpolation error for each subject 
under each treatment-combination. Note that the smaller the criterion 
measure, the “better” the performance. 


fh I I; I, 

1" Inlerval 2” Interval 3" Interval 4" Interval 

Fi F, F, F, F; F, Fi F, 
Sı 3.48 4.20 4.87 4.03 3.60 3.85 2.15 3.56 
Se 7.86 7.31 5.42 5.36 6.41 6.13 3.98 4.20 
S; 5.08 4.64 1.94 6.82 5.23 5.90 6.26 1.54 
Si 447 4.30 5.01 5.48 3.65 4.22 4.60 5.11 
S: 4.23 3.00 2.32 2.29 2.10 2.16 2.17 2.60 
Ss 3.00 — 3.46 4.19 5.00 3.02 3.14 245 3.18 


zX 28.15 26.91 29.15 28.98 24.61 26.60 | 22.21 26.19 
M 4.69 4.49 4.86 4.83 4.10 4.43 3.70 | 437 
XX? 146.75 132.07 155.00 151.76 11113 130.21 94.46 130.15 
T?/6 132.07 120.69 141.62 139.97 100.94 117.93 82.21 114.32 


The following tables of totals are provided for computational convenience. 


246 THREE-DIMENSIONAL DESIGNS 


Field (F) 
(n per cell — 4) 
F, F, 
15.64 | 30.34 
Interval (J) 
23.60 | 47.27 (n per cell = 6) 


24.90 | 48.81 
24.61 | 22.21 104.12 


19.11 | 36.84 
26.60 | 26.19 108.68 


10.65 | 22.07 
55.06 58.13 51.21 48.40 212.80 


14.78 | 27.47 


104.12 108.68 212.80 


Summary Table 


Source 


Intervals (J) 


Field Intensity (F) 


Subjects (S) 73.62 
IF 
IS 25.50 


FS 


Total 108.13 


a) Complete the calculations and fill in the summary table. 


b) Explain in terms specific to this experiment what is meant by a triple 
interaction. By a triple interaction effect. What would it mean to say 
that this interaction is heterogeneous? What is probably the major 
source of triple interaction? 


c) May the hypothesis of no IF interaction be rejected at the 5% level? 
Define as precisely as possible the population to which this hypothesis 


STUDY EXERCISES 247 


applies. In this test, what is assumed to be normally distributed with 
homogeneous variance? 


d) Since "days" is confounded with "intervals" in this design, does it 
seem likely that the IF interaction is really due in part to interaction 
between Days and Field Intensity? Suggest some reasons for believing 
that the IF interaction is primarily intrinsic. 


e) Why does the outcome of the test in (c) direct attention to the simple 
effects of J and F, rather than to their main effects? 


f) If the IF interaction had been non-significant, a test of the main effect 
of Field Intensity would have been indicated. For that test, what 
error term should be used? What assumptions are involved? 


g) Is the simple effect of Intervals for the dark field condition (F;) sig- 
nificant at the 5% level? What assumptions are involved in this test? 


h) What mean square ratio may be used to test the simple effect of field 
intensity for the 1-inch interval? 


i) Under what assumptions would it be legitimate to pool the FS and IFS 
interactions? What test can be applied to determine the tenability 
of one of these assumptions? What would be gained by such a pooling 
in this instance? 


j) Suppose the assumption of no practice (or order) effect with reference 
to intervals could not be made. How then should the intervals be ad- 
ministered if the same method of analysis is to be used? How would 
this affect the apparent precision of the experiment? 


2. An experiment was carried out by Wilson! to investigate remote asso- 
ciations within serial lists of adjectives as a function of (A) degree of learning, 
(B) distribution of practice, and (C) delay of recall. 

The experimental design was the A X B X C design (page 243). The levels 
of these three factors were as follows: Degrees of learning — 50% (one-half 
of the words in the list correctly anticipated), 75%, 100%, and 200% (the 
entire list correctly anticipated in each of two successive trials); Distribu- 
tion of practice — 6, 30, and 60 seconds between trials during practice; 
Delay of recall — 0, 2, 5, and 20 minutes from close of learning period to 
beginning of recall test. The recall test consisted of the same 16 words as 
in the practice list, but with the words in a different random order. There 
were thus 4 X 3 X 4 — 48 different treatment combinations. 

A different random order of the 16 words was assigned to each treatment 
combination as the list to be learned under that condition. Still a different 
random order of the words was later assigned to each treatment-combination 
as the list to be used during recall. According to Wilson, this procedure re- 


1 John T. Wilson, “The Formation and Retention of Remote Associations in Rote 
Learning,” Journal of Experimental Psychology, vol. 39 (December, 1949), pp. 830-838. 


248 THREE-DIMENSIONAL DESIGNS 


duced to a minimum the differential effects which might arise due to inherent 
associations between particular pairs of words. 

On the recall test, each word was exposed until a response was elicited. 
The subject was instructed to respond as quickly as possible with “the first 
word from the list which comes to mind." A “remote association” was de- 
fined as a response word from within the list other than the word following 
the cue word in the original list. The crilerion measure for each subject 
was the number of remote associalions on the recall test. 

The subjects were 144 college students, 40 women and 104 men, from a 
course in elementary psychology. "Three subjects were assigned at random to 
each of the 48 treatment-combinations. Each subject was given a practice 
session 24 hours preceding his experimental session, designed to familiarize 
him with learning and recall instructions and the general procedure of the 
experiment. During the experimental session the subjects under the 30- and 
60-second spacing conditions named colors during the intertrial interval — 
subjects under the 2-, 5-, and 20-minute delay conditions rated cartoons 
during the delay period. This was done to reduce extraneous recitation. 

The over-all criterion means for the treatment groups are as follows: 


A B [^] 


Spacing 
Interval (sec.) 


Degrees of Learning(%) 


Delay Interval (min.) 


50 75 100 200 6 30 60 0 


v.47 6.44 3.47 3.17| 4.60 5.89 4.92 |550 5.17 5.11 4.77 


Summary Table 


Source ss 
Degree of Learning (A) 3 497.39 165.80 
Spacing Interval (B) 2 43.60 21.80 
Delay Interval (C) 3 9.44 3.15 
Interaction (AB) 6 40.24 6.71 
Interaction (AC) 9 112.61 12.51 
Interaction (BC) 6 39.18 6.53 
Interaction (ABC) 18 161.43 8.97 
Within (w) 96 151.33 7.89 
Total (tot) 1661.22 


a) 


b) 


f) 


h) 


i) 


STUDY EXERCISES 249 


In interpreting the results of an analysis of this kind in an A X Bx C 
design, the first step is usually to test the ABC interaction. Suppose 
this interaction is found to be significant. In this case, what is likely 
to be true (in the population) of the AB interactions for the various 
levels of C? Could there be a strong AB interaction for one level of 
C and none for another? In that case, would one be more interested 
in the “simple” AB interactions (those for individual levels of C) 
or in the “main” AB interaction? 


If ABC were significant, and one nevertheless tested the “main” BC 
interaction, to what specific population would the hypothesis tested 
apply? Why would there usually be little interest in this population? 


If the ABC interaction in this experiment had been found significant, 
how would you proceed to test the simple AC interaction for level Bs 
(30-second spacing) of B? 


1f ABC were significant, would you be most interested in the main effect 
of A, the simple effect of A for a given level of B (or C), or the simple 
effect of A for a given BC combination? Explain. 


In this experiment, on how many cases is each of the B means for A;C; 
based? Could one then make a very precise test of the simple effect 
of B for AysC,? 


If there had been a strong and significant ABC interaction in this 
experiment, might the whole experiment have proved practically futile? 
Explain. 


Fortunately for the purposes of this experiment, the ABC interaction 
was found to be non-significant, so that the main two-factor interactions 
could be tested on the assumption that ABC is non-existent in the 
population. What mean square ratio would you use to test the main 
AC interaction? State the hypothesis tested by this ratio, specifying 
the population involved. 


Suppose the AC interaction had been found to be significant, so that 
one would probably not be interested in the main effect of A or C. For 
C, alone, on how many cases would each of the A means be based? 
With AC significant, might the test of the simple effect of A at C, be 
too imprecise to be of much value? 


Were AC significant, on what condition might one still be interested 
in the main effect of B? 


Is any interaction in this experiment significant at the 5% level? Is 
this a desirable outcome if one is primarily interested in the main 
effects of A, B, and C? Explain. 


250 THREE-DIMENSIONAL DESIGNS 


k) If you had planned this experiment just as described, and had then 
found reason to believe that all interactions would probably be found 
significant, would you still go ahead with the experiment? Does the 
answer to this question depend on what are your major interests in the 
experiment? Explain. 


1) Define one of the treatment populations involved in the test of the 
main A effect. 


m) What explanation can you offer for the fact that several of the mean 
square ratios are less than that for within-cells? 


n) On the assumption that all interactions are non-existent in the popula- 
tion, one could pool the sums of squares with that for within-cells. 
Compute the pooled error mean square. How many degrees of freedom 
does it have? Just what is gained by this pooling in this case? Is this 
gain of much importance? Under what conditions is such pooling most 
worth while? What risks does it always entail? 


o) Do you see any point in using a different order of the words in the 
learning list for each treatment-combination? Compare Wilson's pro- 
cedure of randomizing lists with the simplest satisfactory alternative 
you can suggest. 


p) Compute the estimated standard deviation of a cell population. Con- 
sidering the sizes of the treatment means, do you see any reason to 
doubt that these populations are normal? Does this in itself cause 
you to question the validity of the tests of significance used? 
Explain. 


q) Can you see any reason for using different error terms for testing the 
difference in the As and Azs means and the difference in the Ai and 
A means? Explain. 


3. In an experiment in retroactive inhibition carried out by Haverkamp,! 
each subject: (1) learned by the anticipation method (in 8 trials) a list of 
10 stimulus-response pairs of adjectives presented by memory drum, (2) 
learned another (interpolated) list, with the same stimulus words, but with 
different response words, and (3) relearned the original list (after an initial 
warm-up trial). One of the purposes of the study was to determine the effects 
upon retention of the original list of (1) the degree of synonymity between 
the response words of the original and interpolated lists and (2) the degree 


1 H. J. Haverkamp, “Retroactive Inhibition as a Function of Response Synonymit y 
and Interpolated Learning," Ph.D. Thesis, State University of Iowa; August, 1951. 


STUDY EXERCISES 251 


of learning (number of trials) of the interpolated list. "Three different degrees 
of synonymity and three degrees of learning were represented, thus con- 
stituting nine different treatment-combinations. The crilerion measure 
for each subject was the difference between the number of correct responses 
(out of 10) on his first trial of relearning and that on his last trial of original 
learning. Thus the subject with the “best” performance was the one with 
the lowest criterion measure. 

Two hundred seven volunteer subjects from the course in sophomore 
psychology in two undergraduate colleges were employed. Each subject 
reported for one hour on each of three successive days. The first two days 
were devoted to practice sessions, the last to the experimental session. The 
"running" of the subjects was done at two different places over a consider- 
able period of time. The A X B X L design was used, the nine treatment 
(AB) groups being matched on the basis of a control measure, the number 
of correct responses on the last original-learning trial. 

Subjects with the same control measure were assigned to the same “level.” 
Since the range of this measure was 0-10, there were thus 11 possible levels, 
or a possible 99 cells in the 3 X 3 X 11 table. Since the control measure for 
each subject was obtained in the same session as the criterion measure, it 
was not possible to employ the usual “counting off” method in constituting 
the levels. Instead, each of the early subjects was assigned at random to 
one of the treatment-combinations for his own level, the assignment being 
made immediately after the last original-learning trial, the trial on which the 
control measure was obtained. This was continued with subsequent subjects 
until one of the cells in the subject's level was filled with its “quota” of 3 
subjects, in which case the subject was assigned at random to one of the 
unfilled cells in his level. When all cells in a level had been filled, the data 
for subsequent subjects in that level were discarded. After all 207 subjects 
had been run, all data in any level for which all nine cells were not filled 
were discarded, This resulted in the discard of all subjects whose control 
measures were 0, 1, 8, 9, or 10, as well as of the surplus subjects in the remain- 
ing levels. Thus the data for only 162 (out of 207) subjects in six levels 
were used in the analysis." 

The mean of the criterion measures (over all levels) for each treatment 
and treatment-combination group is presented in the following table. 


1 This method is described in detail here because it has possibilities in other similar 
situations. However, Haverkamp discarded more data than was necessary. He might 
have combined unit intervals with small frequencies; that is, he might have defined his 
top interval as 8-10 and his bottom interval as 0-1, making a total of 8 levels. He 
might then have set an initial minimum quota of two per combination, rather than 
three. As soon as a level was filled at this quota, he could raise the quota for that level 
to 3 per cell, when filled at this quota to 4 per cell, etc. Had he used this method, 
Haverkamp would probably have discarded less than 12 cases out of 207, and could 
probably have retained all 8 levels in the analysis. 


252 THREE-DIMENSIONAL DESIGNS 


Degree of Synonymity 
Sı (High) .S;(Med) S; (Low) 


Degree 
of Inter- 
polated 
Learning 


1.81 2.50 2.93 
N (per cell) = 18 


The summary of the analysis is as follows: 


Source 


Syn (S) 


Interpolated 
Learning (I) 


Levels (L) 


SI 


IL 


a) Define carefully the population about which one may draw strict sta- 
tistical inferences about the ST interaction from this experiment. 


b) Test the significance of the ST interaction. What error term did you 
employ? On what assumption could you have used m$spj? msgz? 
On what assumption could you have pooled the sums of squares for 


i 


STUDY EXERCISES 253 


“within,” SIL, and SL to provide an error term for this test? Is there 
sufficient advantage in using a pooled error term to justify making these 
relatively unsupported assumptions in this case? Explain. 


c) The criterion measure in this experiment is the difference between the 
control measure (number of correct responses on the last original-learning 
trial) and the number of correct responses on the relearning trial. In 
consideration of this fact, explain why the variance of the criterion 
measures in the lowest (2) level must necessarily be very much smaller 
than that in the highest (7) level. (Haverkamp does not report the 
original individual measures, so that this inference cannot be verified 
from his report.) 


Test the hypothesis that the SL interaction in the population is zero, 
limiting the risk of a Type I error to 5%. What “apparent” level of 
significance must you employ to feel reasonably certain that the risk 
of a Type I error is less than 5%? Explain fully. 


© 


Test the significance of the IL interaction, limiting the risk of a Type I 
error to 5%. 


Define a real population to which the experimenter would presumably 
like to generalize from this experiment. In what important respects 
does this population differ from that described in (a)? Does the out- 
come of the tests of (d) and (e) facilitate such generalization? Explain. 


e 


& 


f 


pad 


g) Describe in general terms the relative advantages and disadvantages 
of the usual “counting off” method of constituting levels and the method 
employed in this experiment (with improvements suggested in the 
footnote). 


Show that this experiment, based on 162 subjects, is more precise than 
a simple factorial experiment with all 207 subjects would have been. 
That is, estimate the error variance of a single mean (such as M) that 
would have been obtained in a simple factorial experiment, and express 
the corresponding error variance in the matched group experiment as a 
percent of this estimated error variance. 


h 


— 


i) If you wished to test the difference between the S,J; and S.J; means, 
what data would you need that have not here been supplied? Explain. 


Higher-Dimensional Designs 


Analysis in Higher-Dimensional Designs 


We are now ready to generalize to higher-dimensional designs the principles 
that have been established for one-, two-, and three-dimensional designs. 

We have seen that a simple-randomized (one-dimensional) design may be 
replicated into a two-dimensional design (A X L, A X S, or A X R), and that 
a two-dimensional design (A X B or A X L) may be replicated into a three- 
dimensional design (AX BX RB, AXBXL, AXBXS, AXBXC, or 
AXLXR). In the same fashion, a three-dimensional design may be repli- 
cated into a four-dimensional design, in which the replications may be random 
replications, or levels of a control variable, or categories of a fourth factor, etc. 
This may be continued without theoretical limit; a design in any dimension 
may be replicated at will to constitute a design of a still higher order. 

The total sum of squares in a simple-randomized (one-dimensional design) 
may be analyzed into two components: the (main) treatment effect and 
within-treatments. 

The total sum of squares in a two-dimensional design may be analyzed into 
four components: two main effects, one two-factor interaction, and within- 
treatment-combinations. 

The total sum of squares in a three-dimensional design may be analyzed into 
eight components: three main effects, three two-factor interactions, one three- 
factor interaction, and within-treatment-combinations. 

The sum of squares in an n-dimensional design may be analyzed into 2" 
components: n main effects, n(n — 1)/2 two-factor interactions, n(n — 1) 
(n — 2)/6 three-factor interactions, etc. [The successive numbers are the 
terms in the expansion of (1 + 1)", omitting the first term, unity.] 


Computational Procedures 


The sum of squares for a two-factor interaction (regardless of the number of 
other factors involved in the design) is always computed (see pages 225- 
226) by first computing the sum of squares for between-combinations of those 

254 


INTERPRETATION OF HIGHER-ORDER INTERACTIONS 255 


two factors and subtracting the sums of squares for their main effects. For 
example, in a five-factor (ABCDE) design, 


SSpg = SSBE — SSB — SSE, 


in which ssgg is the sum of squares for belween-combinalions of B and E. The 
sum of squares for between-combinations is computed by 


SSBR = XT, n,/ns;s;) SIS. 
i 


in which Ts, is the sum of the criterion measures for a single combination of 
B and E, nz,x, is the corresponding number of subjects, and the summation is 
for all possible combinations. For example, 5,7, would be the sum of the 
criterion measures for all subjects who had received the B;E; combination, 
regardless of the other factors with which this combination may have been 
combined. 

Similarly, 


SSAcD = SS$34GD — SSA — $8c — 88p — SS4c — S8Ap — SSCD 
= ssacp — (sums of squares for main effects of A, C, and D) 
— (sums of squares for all two-factor interactions of A, C, 
and D) 


in which ss3c5 is the sum of squares for belween-combinations of A, C, and D, 
without regard to the other factors. 
Similarly, 
SSABpE-— SSABDE — (sums of squares for main effects of A, B, D, and E) 
— (sums of squares for all two-factor and three-factor 
interactions involving A, B, D, and E only) 
(sums of squares for all main effects and all lower order 


= SSTBDE — 
interactions involving only these factors). 


This last expression suggests a generalized formula for computing the sum of 
squares for any higher-order interaction. Stated in words, the sum of squares 
for any higher-order interaction involving certain specified factors is the sum 
of squares for between-combinations of these factors minus the sums of squares 
for all main effects and lower-order interactions involving the specified factors 


only. 


Interpretation of Higher-Order Interactions 


Any higher-order interaction (in the population) may be interpreted in a 
manner suggested by the interpretations of two- and three-factor interactions. 
We have seen that a three-factor interaction may be regarded as the interac- 
tion of one of the factors with the interaction of the other two. For instance, 
the ABC interaction may be regarded as the interaction of the C factor with 
the AB interaction. This is to say that the AB interaction effects depend on 


256 HIGHER-DIMENSIONAL DESIGNS 


which level of C is involved, or that corresponding AB interaction effects are 
not the same from level to level of C. 

A four-factor interaction may similarly be regarded as the interaction of one 
of the factors with the triple interaction of the other three. For instance, the 
ABCD interaction may be regarded as the interaction of the D factor with the 
ABC interaction. This is to say that corresponding ABC interaction effects 
are not the same from level to level of D. Similarly, ABCD may be regarded 
as (ABD)C or A(BCD), etc. A four-factor interaction may also be regarded as 
the interaction between two two-factor interacticns, for instance, ABCD = 
(AB)(CD) = (AC)(BD), etc. That is, the AB interaction depends on which 
combination of factors C and D is involved, or the BD interaction depends on 
which combination of factors A and C is involved, etc. 


A Notation for Factorial Designs 


For convenience in reference, it is common practice to identify factorial or 
higher-dimensional designs in terms of the product of the numbers of categories 
involved in the classifications. For example, a “3 X 5" design is one which 
involves three treatments or categories in the first classification and five in the 
second. A 2X 2 X 3 X 3 design is one involving two treatments in each of the 
two first classifications and three treatments or categories in each of the others, 
etc. A 2? design is a 2 X 2 X 2 design. 


Practical Limitations of Higher-Dimensional Designs 


A “complete” factorial design is one in which all possible treatment com- 
binations are administered. Complete experimental designs of four or more 
dimensions are rarely employed in educational and psychological research. 
With so many factors or classifications, the experiment becomes so unwieldy 
that administrative difficulties and other practical considerations often out- 
weigh whatever theoretical advantage the designs may have over simpler 
lower-dimensional designs. Even with only three or four dimensions, the or- 
ganization and administration of the experiment often present serious practical 
difficulties. This is especially true if there are several treatments in each 
treatment classification, and if the higher order interactions are significant. 
In a 4 X 3 X 2 X 3 design, for example, there would be 72 different treatment- 
combinations. Obviously, unless some of the treatments could be adminis- 
tered to the same subjects, such an experiment would require at least 72 sub- 
jects — preferably at least 144 — in order to provide a within-combinations 
error term. The number of subjects required for higher-dimensional designs, 
therefore, often prohibits their use. Furthermore, unless many or all of the 
treatments may be simullaneously administered on a group basis, the adminis- 
trative problems involved may also prohibit the use of such complex designs. 
Finally, if the higher-order interactions prove significant, no very useful inter- 


"COMPLETE" AND "INCOMPLETE" FACTORIAL DESIGNS 257 


pretations may be made of the main effects or of the lower-order interactions; 
the analysis must be broken down into a complex pattern of simple effects. 
For example, in a four factor (A X B X C x D) design the “simple” effects of 
A include the average effect of A 


— with all combinations of C and D at a given level of B 
— with all combinations of B and D at a given level of C 
— with all combinations of B and C at a given level of D 
— with a given CD combination at all levels of B 
— with a given BD combination at all levels of C 
— with a given BC combination at all levels of D 


and the effect of A for a given BCD combination. For this reason, higher- 
dimensional designs in general are of dubious advantage unless one may be 
fairly certain in advance that the higher-order interactions will prove non- 
significant. 


"Complete" and "Incomplete" Factorial Designs 


For the reasons just given, complete factorial designs involving four or more 
factors are rarely employed in educational and psychological research. How- 
ever, considerable use is made of what may be described as "incomplete" 
factorial designs, in which only a portion of the possible combinations are 
administered. Under certain conditions, almost as much of the desired in- 
formation can be secured from a properly planned incomplete design as from a 
complete design involving the same factors. The possibilities of incomplete 
factorial designs will be considered in the succeeding chapters (12 and 13). 


Latin Square and Graeco.Latin 


Square Designs 


Introduction 


A serious disadvantage of complete factorial designs involving several fac- 
tors with several levels of each factor is that the number of treatment-combina- 
tions or the number of different groups to which treatment-combinations must. 
be administered may become so large as to make the experiment administra- 
tively unmanageable or impracticable. 

Tn a’ experiments of this character (that is, a X a X a experiments in which 
the number of treatments, a, is the same for each factor), the procedure is 
sometimes followed of including in the experiment only a? of the a? possible 
treatment-combinations, giving each treatment-combination to an independ- 
ent random sample of the same size. The treatment-combinations are so 
selected that each treatment in one classification is combined once and only once 
with each treatment in each of the other classifications. The comparison of 
overall treatment means for any one classification would then appear to be 
completely balanced so far as the effects of superimposed treatments from 
other classifications are concerned. 

For example, in an experiment involving three levels of each of three factors 
(A, B, and C), the 27 possible combinations of treatments (all of which would 
be included in a complete factorial design) are 


AiBY* AsBiC; AsBiC; 
A,B,C, A2B, Cy A;B,C.* 
A,B,C; A;B,C;* A;B,C3 
A,B,C, A2B.C,* AsB4C, 
A1B3CS* AsB1C; AsB3Cs 
A,B,C; A2B2C; A3B2C3* 
A1B3C, AsBsCy A;B;C,* 
A,B;C, A2B;C,* AsB3Cs 
A;B;C;* A2B:C3 A3B;C3 


258 


INTRODUCTION 259 


: Suppose, however, that only the starred (*) combinations are administered 
in the experiment. The results could then be entered in a table like the follow- 
ing, 


B By B; 


Ay 
Az 
As 


in which the combination A,B,C; corresponds to the left cell in the first row, 
AsD3C, to the middle cell in the second row, etc. In the comparison of A (row) 
means, for instance, since each of the three B treatments and each of the three 
C treatments appears once in each row, the A comparisons would seem to be 
balanced so far as B or C effects are concerned, and similarly for the B com- 
parisons and C comparisons. 

This type of design is known as a simple Latin square design. It derives its 
name from an ancient puzzle, that of determining in how many different ways 
Latin letters may be arranged in a square table so that each letter appears once 
but only once in each row and each column, The following are examples of 
Latin squares. 


It may be shown, for instance, that there are two different ways of arranging 
the letters in a 2 x 2 square, 12 different ways in a 3 X 3 square, 576 ina4 x 4, 
and 161,280 in a 5 X 5, etc. 

The “simple” Latin square experimental design is one in which a different 
and independent random sample of n subjects corresponds to each cell of the 
table. As an illustration of a simple Latin square design, let us suppose that 
we wish in a single experiment to compare the effects on reading rate of three 
slyles of type (A), three sizes of type (B), and three widths of column (C). Ina 
complete factorial design of the type suggested on page 243, this would require 
the printing of 27 different editions of the rate-of-reading test, each with a 
different combination of size, style, and width of columns. Suppose, however, 
that we printed only nine editions, combining each style of type only once with 
each size and only once with each width of column, and also combining each 
size only once with each width of column. We might then administer the nine 
editions simultaneously, each to one of nine different randomly selected groups 
of subjects. The main effect of styles would then be independent of the main 
effects of size and width, and similarly for the main effect of size or of width. 


260 LATIN SQUARE AND GRAECO-LATIN SQUARE DESIGNS 


However, for reasons that will be made clear later, this is a defective design 
unless one can assume that there are no interactions of the factors involved. 


Analysis in Simple Latin Square Designs 


In a simple Latin square design with a levels of each factor and n subjects 
receiving each treatment-combination (n subjects in each cell), the analysis of 
the total sum of squares is 


SST = SSA + SSB + SSc + SSw + SS). (90) 
The corresponding analysis of the degrees of freedom is 
(N — 1) 7 (a — 1) - (a — 1) - (a — 1) - (N a’) + (a — 1)(a — 2). 


The sums of squares for the main effects are computed as in any factorial de- 
sign; for example, 


7 


880 = L(Te/ne,) — E 


The sum of squares for within-cells is computed by subtracting the sum of 
squares for between-cells from the sum of squares for total, and the last term 
in (90) is obtained as a residual. The usual procedure in an experiment like 
the size X style X width experiment described above has been to test the main 
effects against ms, as an error term. For instance, the main effect of C has 
been tested by F = ms c/ms, In some applications, the residual sum of 
squares has been used as the error term — necessarily so when there has been 
only one observation per cell. These tests, however, are valid only under 
special and rather unusual conditions to be explained later, and the tests are 
not generally recommended. 

In contrast with (90), the analysis of the total sum of. squares in the com- 
plete factorial design involving the same factors would be 


SS = $84 + SSB + SSc + 884p + S4 + S8nc + SSapot SS. 


If a is large, say, larger than 10, the sums of. squares for A, B, and C, and for 
within-cells would have very nearly the same meaning in an incomplete as in a 
complete design with the same number of cases per cell, and ss,,, in the Latin 
square would be interpreted in the same way as the pooled sums of squares for 
all the interactions in the complete design. This is true, however, only when a 
is large — larger than would usually be the case in any psychological applica- 
tion. It is apparent, then, that there is no possibility of identifying or testing 
individual interactions in the Latin Square. Also the use of the "residual" 
mean square as an error term for testing main effects would be open to exactly 
the same objections as the use of a pooled error term in a complete design (see 
page 236). 

When a is small, however, the problem of interpreting and eval uating main 


CONFOUNDING IN LATIN SQUARE DESIGNS 261 


effects is considerably more involved. The difficulties encountered in this 
situation are fully considered in the following section. 


Confounding in Latin Square Designs 


It may be noted that in the introductory section preceding, the statements 
were made that the comparisons of overall treatment means in any classifica- 
tion would “appear” or would “seem” to be completely balanced with refer- 
ence to the other factors. This phrasing was used advisedly. Actually the 
comparisons are not truly balanced. 

Consider the following 2 X 2 Latin square. 


In this Latin square, the mean for column 1 is 1/2(M.,5,c, + Ma,s,c;) and 
for column 2 is 1/2(Ma,2,c, + M4,s,c,). Hence the sum of squares for “col- 
umns," on which the estimate of the main effect of A will be based, depends on 
the difference between the sums of the means of the following pairs of cells: 


Column 1 Column 2 
AiBiCs AsBi1Cs 
A,B,C; A.B C, 


Obviously, the B and C effects are each counterbalanced and hence have no 
effect on this difference. It is clear that any interaction of A and C or of A and 
B would also be equalized or counterbalanced. If the AC interaction effects 
(deviations of twice-corrected cell means from the general mean) were, for 


example, 


these interaction effects would cancel and would not affect the column differ- 
ence, as indicated below. 


Column 1 Column 2 
+3 +3 
AiG: Bi ACB; 
-3 cu 
A1C:Bs AC, Bs 


262 LATIN SQUARE AND GRAECO-LATIN SQUARE DESIGNS 


Also, if the AB interaction effects were, for example, 


these effects would again cancel, as shown below: 


Column 1 Column 2 
—5 +5 
ABG ABO, 
+5 -5 
ABC ABI 
—5+5= 5-5=0 


However, if the BC interaction effects were, for example, 


Cı C 


these effects would be completely confounded (inextricably intermingled) with 
the A effect in the column difference. That is, the full effect of these interac- 
tions would be included in the column difference, as follows: 


Column 1 Column 2 
+2 =2 
ABC, ABCs 
+2 —2 
AB. ABC, 
24+2=4 —2-—2=-4 


Thus, even though Treatments A, and A» were equally effective at each level 
of B or C separately, the A, mean would be higher than the A; mean in this 
comparison, due only to the fact that Treatment B, works better with C, than 
C», and B; works better with C, than C,. 

The triple interaction effects are the same in both columns and do not affect 
the differences between the columns, For instance, the triple interaction ef- 
fects may be 


CONFOUNDING IN LATIN SQUARE DESIGNS 263 


9 C 


A,B, 
AB, 
AB, 
A,B, 


In this case, the effect of the triple interaction effects on the column means 
would be as follows: 


Column 1 Column 2 
+2 +2 
—— —— 
ABC, AsBi1Cs 
+2 +2 
A,B,C, AsB3C, 
242-4 24+2=4 


In general, then, the BC interaction effect, but only this effect, is completely 
confounded with the A effect in the column difference. Similarly, the AC 
interaction effect is confounded with the B effect and the AB interaction effect 
is confounded with the C effect. 

In 3 X 3 (or larger) squares, the BC interaction is, in general, only partially 
confounded with the A effect in s8,oys- That is, the interaction effects are 
partly but not completely cancelled out. For example, in the Latin square 
design 


En As As 


Bi 
B 
B; 


if the BC interaction effects are 


264 LATIN SQUARE AND GRAECO-LATIN SQUARE DESIGNS 


the column (A) means in the Latin square design are affected as follows: 


Col 1 Col 2 Col 3 
6 -3 -3 
n = a 
A, BiG, A,B,C; A;B,C, 

—4 -1 5 
a, = erem 
A,B,C, AsB3C, AsD3C; 
-2 7 -5 
verd x LA 
AiB3C; AsB3Cs AsB3C; 

6—4—-2-20 7-3-1=3 —-3+5-5=-3 


It is easy to show, similarly, that the AB and AC interactions do not affect 
differences among column means. Similarly, also, the AC interaction may be 
shown to be partially confounded with the row (B) effect, etc. 

In general, in 3 X 3 or larger squares, each main effect is partially con- 
founded with the interaction of the other two factors and with the triple inter- 
action. However, the larger the square, the more completely will these inter- 
actions tend to cancel out in the comparisons for main effects. This is why the 
statement was made earlier that when a is large the sum of squares for any 
main effect will tend to be the same in the incomplete (Latin square) as in the 
complete design. 

Because of this confounding of interactions with main effects, and because 
of the ambiguous character of the “residual” which must often be employed 
as the "error" term in testing main effects, experimental designs based on a 
single "simple" Latin square will perhaps seldom be useful in educational and 
psychological research. In general, they may be safely used only when the 
intrinsic interactions may be assumed to be negligible and the Type G errors 
have been completely randomized with reference to cells (in this case ms,,, is 
the appropriate error term) or when both intrinsic and extrinsic interactions 
may be assumed negligible (in which case ms, or m$,,, or both pooled, is an 
appropriate error term). More complex Latin square designs, however, in- 
volving the administration of several treatments or treatment-combinations to 
the same subject, should prove quite useful. These are considered in Chapter 
13 following. 


Graeco-Latin Squares 


A Graeco-Latin square design is one involving four factors or treatment 
classifications with the same number of levels of each, in which each treatment 
in any classification is combined once and only once with each treatment in 
each other classification. "This design may be pictured below with rows cor- 


GRAECO-LATIN SQUARES 265 


responding to one treatment classification, columns to another, Latin letters to 
another, and Greek letters to the last. 


In the “simple” Graeco-Latin square design (a different random sample in 
each cell), each main effect is partially confounded with the first-order inter- 
actions of the remaining factors and with all higher order interactions. This 
design is therefore as limited in usefulness as is the simple Latin square design. 

The Graeco-Latin square will be used in certain of the designs in the follow- 
ing chapter. A Graeco-Latin square consists of two superimposed Latin 
squares, one formed with Greek and the other with Latin letters, such that the 
same Latin letter is never paired more than once with the same Greek letter. 
Latin squares that may thus be superimposed to form Graeco-Latin squares 
are called “orthogonal” Latin squares. Complete sets of orthogonal 3 X 3, 
4 X 4, and 5 X 5 Latin squares are given below. By superimposing three or 
more orthogonal Latin squares “ hyper-Graeco-Latin " squares may be formed. 
No orthogonal 6 X 6 Latin squares exist. Sets of orthogonal Latin squares for 
larger squares may be found in R. A. Fisher and F. Yates, Statistical Tables for 
Biological, Agricullural and Medical Research, 'Third Edition (London: Oliver 
and Boyd, 1948), pp. 62-63, and examples concerning their use on pages 15-18. 


TABLE 9 
Complete Sets of Orthogonal Latin Squares 
3x3 4X4 
I I I I IIT 
123 123 1234 1234 1234 
231 312 2143 3412 4321 
312 231 3412 4321 2143 
4321 2143 3412 
5X5 
I I Ul IV 
12345 12345 12345 12345 
23451 34512 45123 51234 
34512 51234 23451 45123 
45123 23451 51234 34512 
51234 45123 34512 23451 


Controlling Individual Differences in Factorial 
Experiments Through the Use of 
"Mixed" Designs 


Introduction 


Differences among individuals or subjects are a major source of variation in 
psychological experiments. Usually this source of error (Type S errors) is 
very much more potent than any other. In general, the most precise and 
efficient experimental designs are therefore those in which the effects of in- 
dividual differences are held constant or counterbalanced, rather than merely 
randomized. 

If the number of treatments is small and it is both possible and feasible to 
administer all treatments to each of the subjects, the effect of individual 
differences can be completely equalized in all comparisons. This is done by 
using the treatments X subjects design, in which every subject takes all of the 
treatments. The possibilities of thus completely controlling individual differ- 
ences, however, are extremely limited. In many factorial experiments the 
number of treatment-combinations is so large that it is not practicable to 
administer all of them to each subject. In other instances, one (or more) of the 
treatment classifications may be such that more than one of the treatments in 
the same classification cannot possibly be satisfactorily administered to the 
same subject. This is because the subject is so changed by the administration 
of the first treatment that the results of the later administration of another 
treatment are not comparable or meaningful for purposes of evaluation. For 
example, having learned how to perform a certain task under certain condi- 
tions, the same subject cannot "learn" to do the same task again under a 
second set of conditions. In still other factorial experiments, the criterion 
measure employed may be such that it cannot meaningfully be secured more 
than once for the same subject. For example, the criterion measure may be 
the time required to read a certain passage. Obviously the time required to 
read the same passage a second time would not have the same meaning, nor 

266 


TYPE | DESIGNS 267 


would the time required to read a different passage of possibly different diffi- 
culty. Finally, in some factorial experiments one of the “factors” to be in- 
vestigated or controlled is the order in which the experimental treatments have 
been administered. Quite clearly, the same subject cannot take more than one 
treatment in the same rank order in the same series of treatments. 

In factorial experiments of the types just described it may nevertheless be 
possible to control individual differences in some but not all of the treatment 
comparisons. This may be done by administering some of the treatment- 
combinations to some of the subjects and other treatment-combinations to 
other subjects. That is, the experiment may be so designed that in compari- 
sons involving only one of the factors, the effects of the other factors are 
counter-balanced or equalized, and so that these comparisons are still unbiased. 

Experimental designs of the type just suggested, in which each subject takes 
more than one but not all of the combinations, will be referred to in this dis- 
cussion as “mixed” designs. A “mixed” design may be defined as one in 
which some of the treatment comparisons are inler-subject and some are intra- 
subject comparisons. In the simple-randomized design of Chapter 3, and in 
the simple factorial design of Chapter 9, all of the comparisons are inter-sub- 
ject comparisons. In contrast, in the A X S design of Chapter 6 and in the 
A X B x S design of Chapter 10, all treatment comparisons are intra-subject 
comparisons. "Mixed" designs may be regarded as mixtures of the simple- 
randomized and the treatments X subjects designs. In designs of this mixed 
Lype, of course, the inter-subject comparisons are usually much less precise 
than the intra-subject comparisons. However, the experiment may some- 
Limes be designed so that individual differences are controlled in all of the more 
important comparisons, and so that precision is sacrificed only in the less 
important ones. 

For convenience in the subsequent discussions, we will identify each mixed 
design by a Roman numeral, since any truly descriptive name would be too 
cumbersome. 


Type | Designs 


sign is a two-factor (A X B) design in which 


The simplest type of mixed de 1 hic 
each of the A treatments in combination with any one B treatment is adminis- 


tered to the same subjects, but with each B treatment administered toa differ- 
ent group of subjects. For convenience in later reference, we will call this a 
Type I design. An example of this type of design is diagrammed below for 
the case in which a = 4 and b = 3. The subjects are divided at random into 

same size. The first group is given all 


three groups, not necessarily of the > firs 
combinations of A with Bi, that is, the first group is given treatment-com- 


binations AıBı, A2B1, AsBi, and A,B;. The second group is given all com- 
binations of A with Bs, etc. The total experiment may thus be regarded as 
consisting of three treatments X subjects experiments, one with B held 
constant at the B; level, another at the By level, ete. 


268 CONTROLLING INDIVIDUAL DIFFERENCES 


A, As As A4 


Type 1 B, (Group 1) ni subjects 
Within effects: A and AB B: (Group 2) n subjects 
Between effects: B B; (Group 3) Ts subjects 

N subjects 


The total sum of squares in this table may be analyzed into two or more 
components on each of several different bases, as follows: 


(Disregarding both A and B) S87 = SSs + SSus (91) 
(Disregarding B only) SST = 88a + S85 + S845 (92) 
(Disregarding A only, see page 175) ssr = ssz + sSswe + SSws (93) 
(Disregarding S only) 887 = SSA + SSB + SSAB + SSwceus (94) 
From (91) and (92) we get 

SSus = SSA + SSAS, (95) 
and from (91) and (93) 

88s = SSB + SSsun. (96) 


Thus, as is otherwise evident, the A effect is a * within" subjects effect and the 
B effect is a “between” subjects effect. 

We may next note that if for each subject a constant correction were applied 
to all of his measures so as to make their mean equal to the general mean, that 
is, if the between-subjects sum of squares (sss) were eliminated by arithmetic 
corrections, the sum of squares for A and for AB would remain unaffected. 
Obviously, therefore, s$45 cannot be contained in sss, and by (94), neither is it 
contained in ssa. Accordingly, by (95), ss45 must be contained in ssas. For 
reasons to be explained later, we will call the remainder of ss4s the sum of 
squares for “error within." That is, 

SSerror (v) = SSas — SSAB. (91) 
We will also let sss = SSerror œ), Which we will call the “error between" sum 


ofsquares. Then 
SSws = SSA + SSAB + SSerror (u)- (98) 


From (96), (97), (98), and (91) we find that we have thus analyzed the total 
sum of squares into five components, two of which are based upon between- 
subjects comparisons and three on within-subjects comparisons, as follows: 


“between components" ‘within components” 
$87 = SSB F SSerror @) + 884 + SSAB + SSerror Qo) 


It is very important to note that, because of proportionality of cell fre- 
quencies, each of these components is unaffected by changes in the magnitude 
of any other. The student should satisfy himself of this by noting that arith- 
metic corrections applied to the data to eliminate any of these components 
will leave all other components unaffected. 


TYPE | DESIGNS 269 


The degrees of freedom for sss, ssa, and ssag are (b— 1), (a — 1) and 
(a — 1)(b — 1), respectively; a representing the number of A treatments and b 
the number of B treatments. The degrees of freedom for ss4,,, œ) is ob- 
tained by subtracting the degrees of freedom for B from the degrees of 
freedom for S. The degrees of freedom for ss4,,,(:) is similarly obtained 
as a residual. 

It may be helpful to regard ss,,,, œ) as the sum of squares for " between- 
subjects-within-groups” and to note that SSerror œ) could be obtained by com- 
puting the sum of squares for between-subjects for each group oe elie and 


then summing these sums of squares for all groups. That is, SSerror (0) = -YEses. 


For a single group, the degrees of freedom for ssswa, = n; — 1, and hence the 


degrees of freedom for SSerror œ) is Xo. -)2N-b. 


We may note also that the Type m design consists essentially of a number of 
treatments X subjects (A X S) experiments, one for each level of B. For any 
one level of B the A effect is tested by F = ms4(3/ msas(e,), in which 
Msas) = $84s(5)/(a — Y)(n; — 1). The sum of squares for error (w) may be 
regarded as the sum of the sums of squares for AS computed for the various 
levels of B, that is, 


SSerror (v) = = Seas 
with 
b 
Aferror o) =L@ —1)@-) 7 (a - Y(N =b). 


The analysis in the general case may be summarized as in Table 10. In this 
table, JN refers to the total number of subjects. 


TABLE 10 
Analysis of Type | Designs 


Mean 
Source df Sums of Squares oS 

Between-Subjects | N — 1 Sss 

B b-1 SSB msn 

error (b) N-b SSerror (6) = $88 — SSB MSerror (0) 
Within-Subjects | N(a — 1) SSus = S87 — $88 

A a-l S84 msa 

AB (a— 1)(b — 1) | ssag = SSa5 — 584 — SSB m$AB 


error (w) (a — 1) N — b) | SSerror (w) = SSws — $84 — SSAB| MSerror (w) 


aN-1 ssr 


Total 


270 CONTROLLING INDIVIDUAL DIFFERENCES 


Test of the B Effect: The significance of the B effect is tested by F = 
MSB/MS.rror œ- TO prove that this mean square ratio is distributed as F, 
let us suppose that within each row of the table (page 268) constant correc- 
tions have been applied to the measures within each cell, so as to make the 
cell mean equal to the row mean. It should be clear that these corrections 
would eliminate both the A effect and the AB interaction effect, or would 
make both ms and ms, equal to zero. The other mean squares, however, 
would remain unaffected. That is, ms; in the corrected table would be the 
same as ms; in the original table, and likewise mss,5 = mssup = MSerror ()- SO 
far as the corrected measures are concerned, the design would then be a groups- 
within-treatments design, the a observations for each subject constituting a 
“group,” so that S would take the place of G and B that of A in the analysis 
on page 177. In this groups-within-treatments design, since the number of 
observations (a) is the same for each “group” (subject), the treatments (b) 
effect is tested by F = ms5/mss,p = MSzg/MSerror o. The degrees of freedom 
for mssyp = (N — b), which is the same as that for mSerror ()» The assump- 
tions underlying this test are that, for the population as a whole, the subject 
means are normally distributed with the same variance for each B treatment 
(see pages 178-179), and that the subjects involved in the experiment are a 
simple random sample from the population concerning which inferences are 
to be drawn. The hypothesis tested is that for this population the mean of 
the subject means is the same for each treatment. The error variance of 
the difference between two individual B means, say for Bı and PB is 


ms, EI PEE y 
error (b) an, ans 


Test of the A Effect: The significance of the A effect is tested by F = 
MS.4/MSerror (). To prove that this ratio is distributed as F, let us again sup- 
pose, in the manner of preceding proofs, that constant corrections have been ap- 
plied to the measures in each row of the table (page 268) so as to make each 
row mean equal to the general mean. Let us suppose that within each column 
of the table subsequent corrections have also been applied to the measures in 
each cell, so as to make each cell mean equal to the column mean. These 
corrections would eliminate both the B effect and the AB interaction effect, 
rendering ss = 0 and ss 4,5 — 0. It follows from (97) that sss = sss + 
SSerror (w) = SSerror (w), SO that msg = MShrror(w)- These corrections, however, 
would leave the other mean squares unchanged; that is, ms, = ms, ms o) 
= man. 8nd msan.) MSerror (w). In the table of (twice) corrected 
measures, other-than-chance differences between rows have been eliminated, 
so that the entire table may be regarded as corresponding to a simple treat- 
ments X subjects (A X S) design, with a treatments and s = N subjects. With 
this A X S design, the appropriate test of the treatments (A) effect is made by 
F = mss/ms4s = MS4/MSerror t). The assumptions underlying this test are 
that in the corrected table the AS interaction effects are normally and inde- 
pendently distributed with the same variance for each of the A treatments. 
This is equivalent to assuming for the original table that the AS interaction 


TYPE | DESIGNS 271 


effects in the population are normally and independently distributed for each 
A treatment at each level of B, and that the variances of these distributions 
are the same for all levels of B. The error variance of the difference between 
the means for Az and A; is 2MSprror w) /N. 

Test of the AB Interaclion: The significance of the AB interaction is tested 
by = ms4p/MSerror (w) The proof that this ratio is distributed as F is similar to 
the preceding. We will first suppose that constant corrections have been 
applied to the a measures for each subject so as to make the mean for each 
subject equal to the general mean. These corrections would of course elimi- 
nate both of the between-subjects effects; that is, they would render msg = 0 
and MSerror œ = 0. Let us suppose that other constant corrections have subse- 
quently been applied to the measures in each column so as to make each col- 
umn mean equal to the general mean, thus eliminating msa. From Table 10, 
it is apparent that this would leave AB and error (w) as the only remaining 
sources of yariation in the corrected table. On the assumption that the AS 
interaction effects are normally and independently distributed for each treat- 
ment at each level of B, with the same variance throughout, the measures in 
the corrected table may then be regarded as derived from a simple-randomized 
design, in which the various A-B combinations represent the “ treatments.” 
In this design, the “treatments” effect would be the AB interaction, and 
“within treatments” would be the same as error (w). The “treatments” 
(AB) effect would then be tested by the F-ratio of the mean square for “ treat- 
ments” to that for “within treatments,” which in this case is the same as 
F = msap/MSerror (w). 

Tests of Simple Effects: The simple effects of A are “within” effects, and 
should be tested against error (w). This is on the assumption that error (w) is 
homogeneous for all levels of B. If this assumption is questionable, the A 
effects for any given level of B may be tested against the AS interaction mean 
square computed for that level of B only, as in a simple A X S design. For 
computing the significance of the difference between two A means at any given 
level of B, the appropriate error term is also MSerror (w) [see (36), page 165]. 

The simple effects of B in this design cannot be tested as is the main effect 
of B. So far as any one A-level alone is concerned, the design is a simple- 
randomized design. Accordingly, an appropriate error term for testing the B 
effect at any given level of A is the mean square for within-treatments (within- 
cells) computed for that level of A only. This error term may be used in an 
over-all F-test to test the (simple) effect of B, and also in (tests of differences 
between individual B-means at the given level of A. The only objection to 
this error term is the loss of degrees of freedom involved by using the data 
from a part of the table only. Tf the number of degrees of freedom for this 


error term [Eou - »] is small, say, less than 2 
more stable error term based on all of the experimental data. Assuming that 
” is homogeneous for all levels of A, this 


the mean square for * within-cells l 
more stable error term may be secured by averaging these mean squares for 


0, it may be desirable to use a 


272 CONTROLLING INDIVIDUAL DIFFERENCES 


all levels of A. This average is the same as the mean square for within-cells 
for the entire table. The sum of squares for within-cells for the whole table is 
the residual left when the sums of squares for A, B, and AB are subtracted 
from the sum of squares for total. It is evident from Table 10 that 


SSw cells = SSerror (6) F SSerror (w). 


Accordingly, for the case in which the number of cases is constant for all B 
categories, the mean square for within-cells for the entire table is 


Su cette S8error hi SSerror (w) _ MSerror $) + (à — 1)MSerror o) , 
abn—1) abn-1) am-1) a 

It does not follow from this, however, that the simple effect of B for any 
given level (j) of A may be tested by means of F = msp(j)/msw „n With (b — 1) 
and ab(n — 1) degrees of freedom. The sum of squares for within-cells for a 
single level of A is a x? distributed variable with b(n — 1) degrees of freedom, 
but the sum of these sums of squares for the a levels is not distributed as x? 
with ab(n — 1) degrees of freedom, since these sums of squares are not inde- 
pendent — the same subjects being involved in all levels of A. Nevertheless, 
according to W. G. Cochran, an approximate t-test for testing the significance 
of the difference between two B means for the same level of A may be ob- 
tained by employing the procedure described on page 98. For example, 
suppose we wish to test the difference between the means for B; and B; at the 
second level of A by 

Ms - Ms, 


V 2ms cens/ To 


If we let t, and to represent the values of t which are significant at the selected 
level for the degrees of freedom corresponding to MSerror ( and MSerror (w), re- 
spectively, the value of t which is significant at the selected level may be taken as 


t= 


— MSerror (b) to + (a — Dmseao: (v) bo 
MSerror (œ) + (à — L)MSerror (v) 


t 
From the expression for £ it is, of course, possible to compute the “critical 
difference" which may be applied to any simple B-comparison in the entire 
table (granting that n is constant). 

It should be noted that this procedure has the disadvantage that it does not 
permit an over-all F-test of any simple effect of B. Unless the number of de- 
grees of freedom is too small, therefore, it may be better as well as simpler to 
use the error term computed from the data for the given level of A only. 

For establishing a confidence interval for a single B mean for all levels of A 


(independently of other B means) the appropriate error term is est'd oir = 
MSerror © with N — b degrees of freedom. For a single A mean over all levels 


of B the error term is est’d eir. j= Mw ces/Nb, and the critical value of t is 
computed in the manner shown above. For a single cell mean, the error term 


TYPE Il DESIGNS 273 


is either est’d eir, = MSw cn./n (the critical value of t being found in the 


manner already explained), or the mean square for "within cells" computed 
only for the particular level of A involved. 

An Important Use of the Type I Design: Among the most important applica- 
tions of the Type I design are those in which each of a number of groups of 
subjects is given a different “training series," or is trained in a certain function 
under a different set of conditions. Observations of the function under train- 
ing (criterion measures) are taken at regular or stated intervals during the 
learning or training series, these intervals being the same for all series. In the 
Type I design, these intervals correspond to the A categories and the different 
training series correspond to the B categories. The object of the experiment 
may be to determine whether or not there are any characteristic differences in 
the “learning curve” or in the trend of the A means for the various levels of B. 
Other purposes may be to determine whether any given training series has any 
effect on the function involved, or whether the different series differ in their 
final effect. In many experiments, each series is an “extinction” series rather 
than a “training” series — the object being to eliminate the effects of previous 
training series. How the various tests of significance just considered may be 
interpreted in relation to these purposes will be explained in Chapter 15. 


Type Il Designs 


If the number of treatments (a) is the same for both treatment classifica- 
tions, it is possible to control individual differences in evaluating the main 
effects of both treatments (A and B), without having to administer more than 
a of the a? treatment-combinations to any one subject. 

The subjects are divided at random into a groups; each group takes a treat- 
ment combinations, but no group or individual takes more than one treatment 
in any A category or more than one treatment in any B category. To draw up 
the design, one selects any a X a Latin square, letting the columns correspond 
to the various A categories and the rows to the groups. The B factor is then 
the Latin square factor. This has been done for the case in which a equals 4 
in the left-hand diagram below. It is then readily apparent which treatment 
combinations are to be administered to each group. The design to the right 
represents an alternate way of representing exactly the same design. 


A; As As A4 Ai A» As As 


Type II G 
a=b G 
Within effects: A and B Gs 
Mixed effect: AB Gs 


274 CONTROLLING INDIVIDUAL DIFFERENCES 


If there are only two levels of each factor (a = 2), the AB interaction is 
completely confounded with the G factor (group differences) as was pointed 
out on page 262, and is therefore a “between” effect (see page 268). If there 
are three or more levels per factor (a > 2), the AB interaction is a "mixed" 
effect, being based partly on intra-subject and partly on inter-subject compari- 
sons. Since the intra-subject differences are usually less variable than the 
inter-subject differences, the AB interaction is usually a heterogeneous inter- 
action, 

The AB interaction is heterogeneous in this case in a somewhat different 
sense than in the instances previously considered. We have noted (page 34) 
that the sum of the squared deviations of a number of measures from their 
mean depends upon the sum of the squares of all possible differences among the 
measures taken two at a time. Accordingly, the sum of squares for between- 
cells in a double-entry table depends on the differences for all possible pairs of 
cells in the table. Some of these differences are differences between cells in the 
same row but in different columns; some are differences between cells in the 
same column but in different rows. The rest of the differences are based upon 
“cross comparisons," that is, upon comparisons between cells that are neither 
in the same row nor the same column. "The differences between cells in the 
same row but different columns account for the sum of squares for between- 
columns, which in the case of the Type II design is ss4. The differences be- 
tween cells in the same column but different rows account for the sum of 
squares for between-rows, which in this case is sss. We know that ssceis = $8 
+ssg + ssas. It is apparent, therefore, that the remaining differences, that 
is, the differences based upon cross comparisons, account for the interaction 
sum of squares (ssas). It is apparent from the diagram above, however, that 
some of the cross comparisons are based upon the same group and hence are 
intra-subject comparisons, while some are based upon different groups and 
hence are between-subject comparisons. "The between-subject cross compari- 
sons account for one component of ssas, which we will call SSAp(y The 
within-subject cross comparisons account for the remaining component, which 
we will call ss45(. The between-subject differences will of course be more 
variable than the within-subject differences, and m$4p( Will be larger than 
Msas) It is in this sense that the AB interaction is heterogeneous in the 
"Type II design. 

The meaning of AB(b) and AB(w) may be further clarified by viewing the 
AB interaction mean square in still another light. We may note first that, in 
general in any square table the sum of squares for *between-cells," with 
(a? — 1) degrees of freedom, can be analyzed into (a + 1) independent com- 
ponents, each of which is the sum of squares for “between-sets” 
ment of a sets of a cells each, One such arrangement is that in which each set 
consists of the cells in a single column of the table. The sum of squares for 

between-sets” in this case is what we have called ss 


2H cJ OF 884 if the A cate- 
fories correspond to columns. Another such arrangement is that in which 
each set consists of the cells in a 


given row of the table, the sum of squares for 


in an arrange- 


TYPE Il DESIGNS 275 


between sets in this case being ss,,,, or ssp. The sums of squares for between 
sets in the remaining arrangements are components of the sums of squares for 
interaction. 

Consider a 2 X 2 table, in which the cells are numbered as follows: 


. The possible arrangements of two sets of two cells each, with the correspond- 
ing sums of squares and degrees of freedom, are as follows: 


Arrangement Sets $s df 
A (1+ 3) and (2 + 4) 884 1 

B (1 + 2) and (3 + 4) SSB 1 

AB (1+ 4) and (2 + 3) SSAB 1 
between cells SSoells 3 


The sums of squares for between sets in the last arrangements is what we 
have known as ss4z. In the 2 X 2 table, then, ss475 is based on the difference in 
the means for 2 sets or pairs of cells, one of which consists of cells 1 and 4 and 
the other of cells 2 and 3, as numbered above. 

In the 3 X 3 table 


the possible independent arrangements are 


Arrangement Sets ss df 
A (444%, 2+5+8)and(3+6+9) ssa 2 
B (1-+2+3),(4+5+6)and(7+8+9) ssp 2 
AB; (14- 5 4- 9), (2+6+7) and (3+4+ 8) 88ap, 2 
AB; (8--5--7), (1+6+8) and(2+4+9) ssar, 2 
between cells $$. — 8 


In this case, one component of ssas is based on the comparisons among three 
sets, one of which consists of cells 1, 5, and 9; another of 2, 6, and 7; and the 
third of cells 3, 4, and 8. This component is denoted s$45,, in the table above. 
If the design is a Type II design in which cells 1, 5, and 9 are assigned to Group 
3, the between-sets comparisons are clearly * between-subjects" comparisons, 
since different subjects constitute each set. In this case, ssas, may also be 


276 CONTROLLING INDIVIDUAL DIFFERENCES 


written s$45(;. The between-sets comparisons in the last arrangement (AB:) 
above are clearly “within-subjects” comparisons, since each of the sets in this 
arrangement consists of the same subjects. Consequently, the last component 
may in this case be written $5450). 

In the 4 X 4 table 


9 1197] 117112 


13 |14 |15 |16 


five possible arrangements of five sets of four cells each are as follows: 


Arrangement Sets ss df 
A (1+5+9+13), (2+6+10+ 14), SSA 3 
(3 +7 +11 + 15) and (4 4- 8 + 12 + 16) 
B (Lr 2-43 -F 4), (6+6+7+ 8), SSB 3 
(9+ 10+ 11 + 12) and (13 + 14 + 15 + 16) 
AB, (1-4- 6 4- 11 +16), (2+5+12+ 15), SSAB, 3 
(3 +8 +9 + 14), and (4+7 + 10 + 13) 
AB: (1-7 4-12 +14); (2 4- 8 - 11 4- 13), SSAB, 3 
(8 -- 5 -- 10 +16) and (4+ 6 +9 + 15) 
AB; (1+8+10+4+ 15), (2+7+9+ 16), SSAB; 3 
(3 +6 + 12+ 13) and (4+ 5 + 11 + 14) 
between cells SScetis 15 


If this design is a Type II design and if cells 1, 6, 11, and 16 are assigned to 
Group 1; cells 2, 5, 12, and 15 to Group 2; cells 3, 8, 9, and 14 to Group 3; 
and cells 4, 7, 10, and 13 to Group 4, the sum of squares for between sets in 
this arrangement corresponds to s$45, = $840), While ss45(,) = SSAB, + SSABy, 
with 6 degrees of freedom. It should be apparent to the student from an 
inspection of the preceding diagram and table that the between-sets compari- 
sons of any one of the five arrangements are independent of those in any other, 
since each set in any one arrangement is represented in every one of the sets in 
each of the other arrangements. 

Importance of the Type II Design: The Type II design finds many important 
applications in psychological research. The most obvious application is that 
already suggested, in which A and B represent two experimental factors or 
treatment classifications. A less obvious but quite important application is 
that in which the design is used to counterbalance the effects of the order in 
which the treatments are administered to the subjects in what would other- 


TYPE Il DESIGNS 277 


wise be a simple treatments X subjects design. If there are four treatments, 
for example, the subjects are divided at random into four groups of the same 
size. Group 1 takes Treatment A; first, Treatment Az second, A, third, and 
A, fourth; Group 2 takes A» first, As second, A, third, and A, fourth; Group 3 
takes A; first, Ay second, A; third, and A, fourth; and Group 4 takes A, first, 
A; second, A; third, and A; last. The diagram on page 273 again describes the 
design, if we let B;, Be, etc., each represent the rank order in which a treatment. 
is administered. If the order in which a treatment is administered has any 
effect on the criterion measures, this effect is rendered the same for all treat- 
ments, since each treatment is administered in each possible rank order. This 
design would make possible not only a study of the effects of order upon the 
main effects of treatments, but also of the interaction of order and treatments. 

Another important application is that in which the design is used to counter- 
balance the effect of differences in the tests or other devices employed to secure 
the criterion measures, in what would otherwise be a simple treatments X sub- 
jects design. We will henceforth refer to this source of error as the “criterion” 
factor. Suppose, for example, that it is desired to control individual differ- 
ences in an experiment concerned with the effect of size of type upon reading 
rate. Obviously, the purposes of this experiment could not be served by 
having the same subjects read the same passage four times in four different 
sizes of type. If individual differences are to be controlled, a different passage 
must be used with each size of type. It may not be practicable to use passages 
that have been experimentally equated for difficulty. Instead, the passages 
may be selected so that, in the subjective opinion of qualified observers, they 
are nearly or approximately of the same difficulty. What differences in diffi- 
culty do exist among the passages, however, can be counlerbalanced in the 
experiment by using the Latin square design that has just been described. 
Each passage would be printed in all four sizes of type, so that there would be 
16 tests in all. In the diagram on page 273, the four different passages would 
correspond to Bi, Bs, Bs, and B,, and A would represent the treatment (size of 
type) factor. Group 1 would read all four passages, each in a different size of 
type, as would Groups 2, 3, and 4, but the combinations of size of type and 
passage would differ for each group. Thus each treatment would be adminis- 
tered to each group, and each passage would be used an equal number of times 
with each treatment and with each group, but no passage would be used more 
than once with any treatment or any group. Any differences among passages, 
as well as any differences among groups, would then be completely counter- 
balanced so far as their effect on treatment means is concerned. (To avoid any 
possible bias due to the order in which the passages are read by the same sub- 
ject, the order could be randomized for each subject independently. Another 
possibility would be to counterbalance both order and passages, using one of 
the designs to be considered later.) 

Analysis of Total Sum of Squares: The total sum of squares in the Type II 
design may be analyzed (see diagram on page 273) into independent compo- 
nents in a number of different ways as follows: 


278 CONTROLLING INDIVIDUAL DIFFERENCES 


SST = SSs + SSws (99) 
ssp = 884 + SSB + SSAB + 8800 (100) 
SST = SSG + SS cua + SSuc (101) 


in which ss,c represents the sum of squares for within-cells, and ss cuc repre- 
sents the sum of squares for “between-cells-within-groups.” 

It is evident that if differences among subjects were eliminated by arith- 
metic corrections, the mean squares for A and B would remain unaffected. It 
is clear, therefore, that the sums of squares for A and B are both a part of ss,s. 
Elimination of subject differences, however, would also mean elimination of 
group differences (G). Hence, ss; is clearly a part of sss. 

From (100) and (101) it is evident that 


884 + SSB + SSAB = SSG + SScug. 


Now since both the ss4 and sss are “within” components, and ss is a *be- 

tween" component which cannot be contained in either ss, or Sp, it follows 

that ss; must be contained in ss4z, or that ssa is partly a “between” subjects 

effect. As previously noted, ss42@) = Sse represents the “between” subjects 

component of 5545, and s$45(;) represents the “within” component. That is, 
$8AB(u) = SSAB — SSG. 


We will finally let sseror œ) represent the residual obtained when Ssg is sub- 
tracted from sss, and SSeror (w) represent the residual obtained when $84, SSB, 
and 8545) are subtracted from ssws; that is, 


SSerror (w) = SSws — $84 — SSB — SSAB(u). 


The entire analysis is presented in Table 11, n representing the number of 
subjects in each group. 


TABLE 11 
Analysis of Type Il Designs 


Source df Sums of Squares 

Between-Subjects | an — 1 S$s 

AB (b) al $84p(p = SSG 

error (b) a(n — 1) SSerror (b) = S$s — SSG 
Within-Subjects | an(a — 1) SSws = SST — SSg 

A ail $84 

B a—1 SSB 

AB (w) (a — 1)(a — 2) SSAB(w) = SSAB — SSq 

error (w) GOED (S10! Stersae (a) = S825 88a 1 SSB ESAD 

Total an-—1 $87 


TYPE Il DESIGNS 279 


Tests of Significance of the Main Effects of A and B: Let us suppose that 
constant corrections have been applied to the a measures for each subject so as 
to make each subject mean equal to the general mean. This would of course 
render ms4p@) and MSerror œ) both equal to zero, since both are “between” 
effects. Let us suppose that subsequent corrections have been applied to the 
measures in each row so as to make the row mean equal to the general mean, 
thus rendering msg equal to zero. Let us suppose, finally, that within each 
column of the table further corrections have been applied within each cell so as 
to make each cell mean equal to the column mean. This would eliminate the 
remainder of the AB interaction — its "within" component — or would ren- 
der msapw) equal to zero. The total sum of squares in the corrected table, 
then, consists only of ss4 and s$5,,: (0, SO that $84 = SSlrror (w) Let us now 
assume that the AS interaction effects are normally and independently dis- 
tributed with the same variance for each cell in the table. On this assumption, 
since there are no systematic differences among cells in the same column, the 
design, so far as the corrected measures are concerned, is a simple-randomized 
design, in which ms4/ms;4 is distributed as F. However, ms, = ms, and 
ms, = msan, w): Hence, in the original table, m84/MSerrox w) 18 distributed 
as F. 

'The assumptions that were necessary in this proof are equivalent to the 
assumptions that in a simple A X S experiment involving the same A treat- 
ments and the same subjects, but performed for any given level of B alone, the 
AS interaction effects would be normally and independently distributed with 
the same variance for each treatment, and that this variance would be the 
same for any level of B. The important considerations in judging the extent 
to which these assumptions are satisfied are essentially the same as those 
presented on pages 159-160 in Chapter 6. 

The proof that msB/MSerror (w) 18 distributed as F is exactly similar to the 
preceding. 

Test of the AB Interaction: To test the hypothesis that there is no interaction 
between A and B, each of the components of AB must be tested separately, 
cach against its own appropriate error term. If both components are non- 
significant, the hypothesis of no interaction may be retained. If either com- 
ponent is significant, the hypothesis must be rejected. When this procedure 
is followed, however, the risk of a Type I error is not that indicated by the 
level of significance at which the separate components are tested. Rather, it is 
almost twice that amount. If the hypothesis of no AB interaction is true, the 
within-component of AB will be "significant" at the 5% level five percent of 
the time. The between-component of AB will also be "significant" five 
percent of the time. But since the two components are entirely independent 
of one another, the times that one component is “significant” will rarely (5% 
of 5% of the time) coincide with the times when the other is “significant.” 
Under a true null-hypothesis, one or the other or both of the components will 
be “significant” at the 5% level almost ten percent of the time, so that if the 
null-hypothesis is rejected when either or both components is or are significant 


280 CONTROLLING INDIVIDUAL DIFFERENCES 


at the 5% level, a true null-hypothesis will be rejected almost ten percent of 
the time. Accordingly, if in testing the hypothesis of no interaction in the 
Type II design, one wishes the risk of a Type I error to be less than X %, one 
should test each component at the X/2% level. (Strictly the rule should read 
, 1 c S : E 

... at the 1x OO level,” but when X is small, 100 


and the term reduces to an approximate X/2.) The same rule applies in 
the testing of a mixed interaction in any of the designs described subsequently 
in this chapter. 

The “between” component of AB is tested by F = ™MS4.B()/MSerror (»)» TO 
prove that this ratio is distributed as F, let us again suppose that corrections 
have been applied to the data so as to eliminate all other mean squares. In an 
analysis based on the subject means, the design would then reduce to a simple- 
randomized design in which msg/ms,g is distributed as F. In this case, 
MSG = msg = ms4pq), and MSG = MSerror () (= msssg). The underlying as- 
sumptions are that the distribution of subject means is fundamentally normal 
for each group, that these distributions are homogeneous in variance for all 
groups, and that each group was originally a random sample from the same 
population. 

The “within” component of AB is tested by F = msaz(w)/MSerror (w). TO 
prove that this ratio is distributed as F, we again suppose that corrections 
have been applied to the data so as to eliminate all other mean squares, It 
may then be shown that the corrected data may be regarded as coming from a 
simple-randomized experiment, in which the between-treatments mean square 
corresponds to ms47 (w), and the within-treatments mean square corresponds to 
MSerror (w). The underlying assumptions are the same as those involved in the 
tests of the main effects of A and B. 

It will be noted that when there are only two treatments in each classifica- 
tion, that is, when a = 2, the degrees of freedom for ms45( becomes equal to 
zero, which means that the corresponding sum of squares vanishes. This is 
equivalent to saying what has been pointed out earlier (page 274), that when 
a = 2, the AB interaction is completely confounded with G, or that the inter- 
action effect is entirely a “between ” subjects effect. 

Tests of Simple Effects: The simplest way to test the simple effect of A ata 
given level of B is to use the data from the given level of B only. Fora single 
level of B, the design is a simple randomized design, and the appropriate error 
term for testing the (simple) A effect is msyitnin cells computed for the given level 
of B only. Either an over-all F-test or specific t-tests may be employed at the 
given level of B. The procedure for testing simple B effects, using the data 
from only a single column (level of A) is exactly the same. 

If one wishes to employ an error term based on all of the experimental data, 
an approximate t-test for testing the significance of the difference between two 
different A (cell) means at the same level of B may be applied by the method 
described on page 272. Again, ss, cells = SSerror (6) + SSerror (w). The denominator 
of the t is 2ms, eets/~/n, and the critical value of tis determined as before. For 


becomes negligible, 


TYPE Ill DESIGNS 281 


testing simple effects of B at a selected level of A, exactly the same procedure 
is employed. 

The Control of Criterion Differences: Among the more important applications 
of the Type II design are those in which it is desired to equalize the effects of 
the differences (usually small) in difficulty among the non-equivalent criterion 
tests (B) employed, such as lists, reading passages, and alternate forms of 
achievement tests. In such cases there is often no good a priori reason to 
suppose that there is any intrinsic AB interaction, and the experiment may 
sometimes have been administered so as to render unlikely any AB interaction 
due to varying Type G errors. Suppose, for example, that in the experiment 
concerned with effect of size of type upon reading rate described on page 277, 
all subjects (and all groups) were tested simultaneously in the same room. 
That is, all four groups would read their B; passages (in a different size of type 
for each group) simultaneously, all would then read their B; passages simul- 
taneously, etc. — “order” being confounded with “ passages." In that case, it 
would be very difficult to conceive of specific differences in the administration 
of the 16 criterion test forms that would bias the results on any particular 
form with reference to any particular treatment. In such an experiment, 
therefore, if AB(w) did not differ significantly from error (w), one might safely 
assume that the observed AB interaction is due only to random Type S errors, 
and include it in the error terms. In that case, the sums of squares for the 
pooled error terms are 

SSerror (b) = SSS, ANd SSerror (w) = S$us — SSA — SSB.» 


The number of degrees of freedom for the pooled error term dfar (5 18 8n — 1, 
and for dferor (w) is an(a — 1) — (a — 1) — (a —1) = (an — 2)(a — 1). The 
MSerror œ) WOuld be needed as the error term if one wanted, for example, to test 
Mas, — Man; or if one wanted to establish a confidence interval for the mean 
for a given A treatment, independent of the other A treatments. 

In Type II designs employed to counterbalance the effect of the rank order 
in which the treatments are administered to the subjects, the interaction of 
treatments and order may likewise frequently prove non-significant. If there 
is no good a priori reason to suppose that there is an interaction, either intrin- 
sic or due to Type G errors, the sums of squares for AB(b) and AB(w) may, as 
suggested above, be pooled with the corresponding error terms. In general, 
however, such poolings should not be resorted to until tests of the separate 
components of AB have failed to reveal a significant interaction. 


Type Ill Designs 


Suppose that a factorial experiment is to be performed with three factors, 
A, B, and C, with a possible total of abe treatment-combinations. In such sit- 
uations, one of the treatment classifications (A) may sometimes be such that 
all treatments in that classification are administrable to the same subjects, but 


282 CONTROLLING INDIVIDUAL DIFFERENCES 


this may not be true of the other (B and C) classifications. In that case, an 
experiment may be designed in which the main effect of A and all interactions 
involving A will be “within” effects, but the main effects of B and C and the 
BC interaction will be “between” effects. 

The subjects are divided at random into bc groups of the same size. Each 
group takes one of the B-C combinations in combination with each level of A; 
that is, each group takes a of the possible abe combinations. The design is 
diagrammed below in two different ways for the case in which a = 4, b = 2, 
and c = 3. The numbers on the faces of the parallelepiped identify the groups. 
As seen in the diagram, Group 1 takes treatment-combinations A;B;Ci, AsBiCi, 
AsB1 C5, and A,B,C, etc. 


Type III Group A, Ag As En 
Within 1 B,C, 
Effects: 
2 B 
A, AB, AC, 3 2^ 
ABC at 
4 BoC; 
Between 5 B.C; 
Effects: 6 BoC; 
B. C, BC 
Cı G Gs 
12.3 | 12,8 | 12,3 | 1,2,3 Bi IZS 
4,5,6 | 4,5,6 | 4,5,6 | 4,5,6 25|25|25|25| B:|4|[51|6 
3,6 | 3,6 | 3,6 | 3,6 
Disregarding C Disregarding B Disregarding A 


Analysis of Total Sum of Squares: Disregarding the A, B, and C classifica- 
tions, the total sum of squares may be analyzed into its between-subjects (S) 
and within-subjects (wS) components. 

In any three-way table, regardless of the meaning of the classifications 
corresponding to the three dimensions, the total sum of squares may always be 
analyzed into components in the manner of Chapter 10. In the Type III 
design, therefore, the total sum of squares may be analyzed into the A, B, C, 
AB, AC, BC, ABC, and within-cells components. It is evident from the 
preceding diagrams which of these components are “between” and which are 
“within” components. 

Disregarding the C classification, the design is clearly a Type I design (page 
268) in which A and AB are “within” components and B is a * between" com- 
ponent. Disregarding the B classification, the design is again a Type I design, 


TYPE Ill DESIGNS 283 


in which A and AC are “within” components and C is a “between” com- 
ponent. Disregarding the A classification, the design is a simple factorial 
design of the type considered in Chapter 9, in which all components — B, C, 
and BC — are “between” components. When the design is viewed as repre- 
sented in the upper right-hand diagram preceding, it is apparent that the 
design is still of Type I. In this case, the row classification is the B-C 
cross classification, and ss,,,, = 888 + 88¢ + 585c. From this it follows that 
SSrows X columns = SSAB + $846 + $84nc. Since the rows X columns interaction 
is clearly a “within” effect, it follows that ABC is a within effect also. 

The analysis of the total sum of squares for this design is summarized in 
Table 12. This table indicates also how the various sums of squares may be 
computed, 

Tests of Significance: The main effects of B and C and the BC interaction are 
tested against MSerror () as follows: 


F = msp/MScrror (b)s 
F = mS¢c/MSerror œ)» 
F = msno/DiSexor (b)+ 


‘That these mean square ratios are distributed as F is evident from the fact that 
if we disregard the A classification and base the analysis on subject means, 
leaving only between-subjects differences, the design reduces (see lower right- 
hand diagram) to a simple two-factor (B X C) design, in which MSerror @ is the 
within-cells mean square. 

The error variance for the difference between the means of two individual B 
categories is 2msero; c) / aen. That for a difference between the means of two 
individual B—C combinations is 2 MSerror 0»/an. 

The main effect of A and all interactions involving A are tested against 
MSerror (oy Suppose that we eliminate all effects involving B (B, BC, AB, and 
ABC) through arithmetic corrections, and regard all of the corrected data as 
projected into the AC face of the three-way table (see the middle-lower dia- 
gram preceding). The design then reduces to a Type I design in which the 
mean squares for A, C, and AC based on the corrected data are the same as 
those in the original table. In this Type I design, the mean square for “within 
A-C combinations" (MSérror (w)) is the same as that for “within A-B-C com- 
binations" (mSerror (w) in the original table, since all B differences have been 
eliminated within each A-C combination. In this Type I design, then, 
A is tested by F = msa J mScrror (u) = MSA/MSerror (w) and AC is tested by 
F = msq¢/MSexror (w) = MSAc/MSe-ror w- BY eliminating all effects involving 
C (lower left-hand diagram) it may similarly be shown that AB may be 
tested by F = msap/MSerror (o The error variance for the difference between 
the means of two individual A categories iS 2 MSerror (w) /ben. f 

The procedures to follow in testing simple effects of differences among in- 


dividual treatment means can readily be inferred from the discussions of the 


simple factorial design (Chapter 9) and the Type I and Type II designs of this 


284 CONTROLLING INDIVIDUAL DIFFERENCES 


chapter. For instance, simple effects of B at a given level of C are tested 
against error (b). Simple effects of B at a given level of C, or simple effects of 
C at a given level of B, are tested against error (b). Simple effects of A at a 
given level of B or C, or for a given BC combination, and the simple AC or AB 
interactions for given levels of A and C respectively, are tested against 
error(w). Simple effects of either B or C at a given level of A, or the simple 
BC interaction at a given level of A, or simple effects of B for a given AC com- 
bination, or simple effects of C for a given AB combination, may be tested 
as in a simple factorial design (B X C) by using the data from the given level 
of A only — the error term being the mean square for “ within cells” computed 
only for that level of A. For t-tests of differences in individual pairs of means 
within a given level of A, one may use the data from the entire table by apply- 
ing the approximate /-test described on page 272. The Type III design is 
particularly useful in experiments involving comparisons of learning curves or 
of trends in training or extinction series for different groups, since it permits 
the use of matched groups in such experiments. In this case, the intervals in 
the training or extinction series correspond to the A categories, the different 
training or extinction series correspond to the B categories, and the levels of 
the control variables correspond to the C categories. The interpretation of the 
data for applications of this type is discussed in Chapter 15, pages 351 ff. 


TABLE 12 
Analysis of Type Ill Designs 
Source df Sums of Squares 
Between-Subjects | ben — 1 Sss 
B b-1 SSp 
G c-l SSc 
BC (b — 1)(c - 1) SSpc = SSO — SSB — SSC 
error (b) be(n — 1) SSerror (b) = SSres (9) 
Within-Subjects | ben(a — 1) SSws = SST — S85 
A a-l $84 
AB (a — 1)(b — 1) SSAB = SSIB — SSA — SSB 
AC (qb) (e— D) Ssac = SSO — SSA — sso 
ABG (a — 1))(b — I)(c — 1) | ssane = ssage — SSIB — ssc 
— S8Ac — SSBC 
error (w) be(a — 1)(n — 1) SSerror (v) = S$jes (w) 
Total aben — 1 ssr 


TYPE IV DESIGNS 285 


Type IV Designs 


If the number of levels is the same for two of the factors in a three-factor 
experiment, it is possible to control individual differences in evaluating the - 
main effects of both of these factors, as well as the interaction of either of these 
factors with the third, without having to administer all treatments to each 
subject. That is, if a = b, it is possible to control individual differences in 
evaluating A, B, AC, and BC, leaving C as a between-effect and AB and ABC 
as mixed effects. The design consists essentially of c replications of the Type 
II design, one for each level of C. For the purposes of the experiment, the 
subjects are divided at random into ac groups of the same size (n). 

To diagram this design for any given level of C, one proceeds exactly as with 
the Type II design; that is, one selects an a X a Latin square, letting A be the 
column factor, G (groups) the row factor, and B the Latin square factor. The 
diagram for each of the other levels of C is exactly the same, except that a 
different set of groups corresponds to the columns for each C level. A design 
of this kind is represented in the top pair of Latin squares below, for the case 
in which a = b = 3 and c = 2. The parallelepiped to the right in the top row 
constitutes an alternate way of representing exactly the same design. From 
either of these diagrams, it is apparent that group 1 takes A,B,Ci, A;B;Cy, and 
A;B;Ci, etc. 


Type IV:a=b 
Within Effects: A, B, AC, BC 


Ai 


Between Effects: C 
G| Bı 


Mixed Effects: AB, 
ABC when a > 2. Gs 
(When a = 2, these 
are “between” Gi 
effects.) 


A, A: As 


Comb'd Gi + G: 
or Comb'd G; + G, 


Comb'd G; + Ge 


Disregarding C 


286 CONTROLLING INDIVIDUAL DIFFERENCES 


Ay Ay As C C 


13, | 1,3,5 | 1,3,5 1,3,5 | 2,4,6 


2,4,6 | 2,4,6 | 2,4,6 1,3,5 | 2,4,6 


Disregarding B 1,3,5 | 24,6 


Disregarding A 


Analysis of the Tolal Sum of Squares: 'The total sum of squares may be 
analyzed as before into its S and wS components as well as into its A, B, C, 
AB, AC, BC, ABC, and within-cells components. To classify the latter 
components, we note first that this design is a combination of the Type | 
design and the Type II design. Disregarding the C classification, the design 
is a Type II design, in which A and B are “within” effects and AB is partly a 
“within” effect and partly a “between” effect. Disregarding the B classifi- 
cation, the design is a Type I design, in which A and AC are “within” effects 
and C is a “between” effect. Disregarding the A classification, the design 
is again a Type I design, in which B and BC are "within" effects and C is a 
“between” effect. 

We have seen that in general the ABC interaction may be regarded as the 
interaction between the C factor and the AB interaction. To say that there 
is an ABC interaction, therefore, is to say that corresponding AB interaction 
effects are not the same for each level of C. In this design, the AB inter- 
action consists of two components, a “within” and a “between” component 
If the “within” interaction effects of AB are not the same for each level of C, 
we would say that there is an interaction between C and the A B(w) interaction, 
or that there is an ABC(w) interaction. The mean square for AB(b) for C; 
only is the same as the mean square for between-groups for C, only, that is, 
the mean square for between groups 1, 3, and 5 in the diagram above. Ac- 
cordingly, when the data are regarded as projected into the AG face of the 
AGC parallelepiped (with B as the Latin square factor), ms45c(, is the mean 
square for the rows X columns interaction, which is clearly a “between” 
effect. The sum of squares for ABC(w) may be computed as a residual by 
subtracting the sum of squares for ABC(b) from the sum of squares for ABC. 

The complete analysis of the total sum of squares for this design is sum- 
marized in Table 13. In this table, ss5(45 is the sum of squares for * between- 
combined groups," each combination of groups containing all subjects who 
have received the same treatments so far as factors A and B are concerned, 
that is, all subjects who have been given the same combinations of A and B, 
without regard to C. For example, groups 1 and 2 have each had treatments 
AıBı, ABs, AsBs, so groups 1 and 2 together therefore constitute one of the 
"combined groups." If the two A X G diagrams (upper left) on page 285 
are combined, so that Groups 1 and 4 are in the top row, Groups 2 and = 
in the second, and Groups 3 and 6 in the third, then ssgcag) is the between- 
rows sum of squares 


TYPE IV DESIGNS 287 


In Table 13, ss¢ is the sum of squares for between the ac groups into which the 
subjects were originally divided at random. In the illustration, it is the sum 
of squares for “between (the six) groups.” 


TABLE 13 
Analysis of Type IV Designs 
Source df Sums of Squares 
Between-Subjects | acn—1 $88 
[9] t1 SSc 
AB (b) a—1l SSAB(b) = S$G(AB) 
ABC (b) (a-1)(c-1) SSApc(p = SSG — S$8G(AB) — S8C 
error (b) ac(n—1) SSerror (b) = SSres (b) 
Within-Subjects acn(a—1) SSws = SST — SSS 
A a-1 $884 
B [22i SSB 
AB(w) (a—1(a—2) SSAB(u) = SSAB — SSAB(b) 
AG (a—1)(c—1) SSac 
BG (a—1)(c—1) SSBC 
ABC (w) (a—1)(a—2)(c—1) SSABC(u) = SSABC — SSABC() 
error (w) ac(a—1)(n—1) SSerror (w) = S$res (u) 
Total a*cn—1l E 


It will be noted in Table 13 that when a — b — 2, the degrees of freedom 
for two of the entries reduce to zero, namely AB(w) and ABC(w). In this 
case, 5545 is identical with sses). That is, the AB interaction is completely 
confounded with subject differences (see pages 261 to 264). 

Tests of Significance: As in the preceding designs, the significance of any 
“within” effect is tested against error (w), and of any “between” effect 
against error (b). Simple effects of A or C at a given level of B, or the simple 
AC interaction at a given level of B, may be tested as in a simple factorial 
(A X C) design, using the data from the given level of B only. Likewise, 
simple effects of B or C at a given level of A, and the simple BC interaction 
at a given level of A, may be tested as in a simple factorial (B x C) design 
using the data from the given level of A only. For comparisons of individual 
means (of cells, rows, or columns) within a given level of A or a given level 
of B, the approximate t-test on page 272 may be applied, using ms, ccs for 
the entire table. The validity of any of these tests may be established in 
ways similar to those employed with previous mixed designs. The student, 


288 CONTROLLING INDIVIDUAL DIFFERENCES 


will find it a valuable exercise to demonstrate the validity of each of these 
tests, and to note specifically the assumptions underlying each. 

Uses of the Type IV Design: In general, in this and other mixed designs, 
mixed interactions cannot be so satisfactorily evaluated as homogeneous inter- 
actions, since the evaluation of the between-subjects component of the mixed 
interaction is usually much less precise than that of the within-subjects com- 
ponent. However, it is expected that these designs will in general be used in 
situations in which there is relatively little interest in the mixed interactions, 
and in which these may often not have to be evaluated at all. The Type IV 
design, for instance, will prove most satisfactory in situations in which the 
principal interest is in A and B and their interactions with C, and in which C, 
AB, and ABC are of relatively minor interest. 

The Type IV design provides a means of counterbalancing either order 
effects or criterion effects (see page 277), in an experiment that would other- 
wise employ a Type I design (with order effects or criterion effects randomized 
for each subject independently) When either criterion effects or order effects 
(represented by B) are being counterbalanced, and when none of the inter- 
actions [AB(b), AB(w), ABC(b), ABC(w), and BC] with these effects proves 
significant, the mean squares for these interactions may often safely be 
pooled in the error terms. In that case, the sums of squares for the pooled 


error terms are 
SSerror (b) = $83 — S30, ANA SSerror (u) = S8es — $84 — SSB — $840, 


with c(an — 1) and (a — 1)(aen — c — 1) degrees of freedom, respectively. 

The Type IV design may also be used to counterbalance both order and 
criterion effects in what would otherwise be a treatments X subjects experi- 
ment, A representing treatments, B the criterion effect, and C rank order of 
administration of the A treatments. 


Type V Designs 


When the number of levels is the same for all three factors (a — b — c), 
it is possible to control subject differences in the evaluation of all of the 
main effects, leaving all of the interactions as mixed effects. The subjects 
are divided into a? different groups at random, the subjects in each group 
taking only a of the possible a? treatment-combinations, no subject taking 
any treatment in any classification more than once. For purposes of consti- 


TYPE V DESIGNS 289 


tuting the groups, the design may be considered as consisting of a blocks, 
such as those diagrammed below for the case in which a = 3. Each block is a 
Graeco-Latin square, but each block utilizes a different Graeco-Latin square. 
Note that each block is the same so far as the A-C pattern is concerned, but 
that the B pattern varies, so that each B treatment is combined only once 
with each A-C combination. 

To diagram the first block of this design, one first prepares an a X a square, 
and lets the A-categories correspond to the columns and the groups (G) 
to the rows. One then selects two orthogonal a X a Latin squares (see page 
265) and lets the numbers in one of these squares represent the B subscripts 
and those in the other square the C subscripts in the corresponding cells of the 
diagram. To diagram the next block, one simply transposes the B pattern 
of Block 1 so that the B's in the first column of Block 1 become those in the 
last column of Block 2, while the B's in columns 2, 3, etc., of Block 1 become 
those in columns 1, 2, etc., respectively in Block 2. Each subsequent block 
is derived from the preceding block in the same fashion. The process might 
be described as that of “rotating” the columns of the various blocks so far 
as the B factor is concerned, This has been done for the case in which a = 3 
in the diagrams below, using the orthogonal Latin squares presented on 
page 265. 


Type V 
a=b=c 
Within Effects: A, B, C 


Mixed Effects: All interactions when a > 2. (When a = 2, all 
double interactions are “between” effects, and 
the triple interaction is a "within" effect.) 


Block 1 Block 2 Block 3 
A, As As A, Az As A, As As 


BC; | BsC2 | BiC3| G; | BC; | BCe | BoCs 


B,C; | B:C; | BsC, | Gs | Cs | BsCs Bi 


B,C; | BY; | B-C; | Go | B103 | B3Cs | BsCe 


'This design may also be represented as in the upper left-hand diagram 
on page 290, each of the layers parallel to the front face being separately 
shown to the right of the parallelepiped. The numbers in the various cells 
identify the groups, as in preceding diagrams. 


290 CONTROLLING INDIVIDUAL DIFFERENCES 


C; layer Cs layer Cs layer 
B|1|6|8 DEED 9|2)4 


B: |4|9]|2 8.1/6 3.5|7 


B&|7|3|5 2|4]|9 6/8)1 
A, Az A; A; Ae As A, Ae As 


G G G A, Az As 


1,4,7 | 2,5,8 | 3,6,9 


1,5,9 | 2,6,7 | 3,4,8 


3,6,9 | 1,4,7 | 2,5,8 


3,4,8 | 1,5,9 | 2,6,7 


2,5,8 | 3,6,9 | 1,4,7 2,6,7 | 3,4,8 | 1,5,9 


Disregarding A Disregarding B Disregarding C 


Analysis of Total Sum of Squares: When the A classification is disregarded 
the design is clearly a Type IT design, in wbich B and C are “within” effects 
and BC is a “mixed” effect. 

When the B classification is disregarded, the design is again a Type II 
design, in which A and C are “within” effects and AC is a “mixed” effect. 

From the lower right-hand diagram, similarly, it is evident that A and B 
are “within” effects and that AB is a “mixed” effect, 

Now we know that 


SSABC = SSeells — SSA — SSB — SSe — SSAB — SSac — SS$pc. (102) 


We know also that the sum of squares for cells consists in part of a “between” 
and in part of a “within” component, as follows: 


SSoells = SSG + SScwa (103) 


in which sse, the “between” component, is the sum of squares for between 
the a? groups, with (a? — 1) degrees of freedom, and ssc,g is the sum of 
squares for between-cells-within-groups, with (a — 1) degrees of freedom for 
each of the a? groups or a total of a?(a — 1) degrees of freedom. We know 
also that 


SSAB = 884206) + SSAB(u) (104) 
SSAC = SS4c(y) + SSAct) (105) 


SSBC = SSBC) F SSBC). (106) 


TYPE V DESIGNS 291 


Substituting from (103), (104), (105), and (106), in (102), and rearranging 
terms, we may write 


SSABC = [SSe — SSApQ) — S$AcQ) — 385c0)l 
+ [sscwg — 884 — SSB — SSC — SSAB() — S8aclw) — SSBC] (107) 


which makes it clear that ss45c is partly a “between” and partly a "within" 
effect. If we let ss4gcq) equal the first of the expressions in brackets in 
(107) and let ssazc( represent the second term in brackets, we may sum- 
marize the entire analysis as in Table 14. In this table it will be noted that 
the degrees of freedom for certain components reduces to zero in the case in 
which a = 2, which means that in a 2X 2 X 2 design there is no within- 
subjeets component of the two-factor interactions and no between-subjects 
component of the three-factor interaction. 


TABLE 14 
Analysis in Type V Designs 


Source df Sums of Squares 

Between Subjects| a’n—1 SSg 

AB (b) a—1l 88 AB(b) = S8G(AB) 

AC (b) a—1 $840(») 7 $$G(40) 

BC (b) a-1 SSBC) =SSa(BO) 

ABC (b) (a—1)(a—2) S8.4B0(b)=SSG—SSG(AB)—88G(40)—SSG(BC) 

error (b) a@(n—1) SSerror (b) = SSres (0) 
Within-Subjects | a?n(a—1) SSys=SS7—SSg 

A a-1l E 

B a-l SSB 

Cc a—1 E 

AB (w) (a—1)(a—2) SSAB(w)=SSAB — SSAB(Q) 

AC (w) (a—1)(a—2) SSAC) = SSAC — SSAC) 

BC (w) (a—1)(a—2) SSBC(w)= SSBC — S$pc() 

ABC (w) (a—1)(@—3a+3) | 88azctu)=SSase — SSABCQ) 

error (w) a*(a—1)(n—1) SSerror (w) 7 SSres (w) 


Tests of Significance: As in the preceding designs, the significance of any 
“within” effect is tested against error (w), and of any “between” effect 
against error (b). Simple effects of either A or B at a given level of C, or 
the simple AB interaction for a given level of C, may be tested as in a simple 


292 CONTROLLING INDIVIDUAL DIFFERENCES 


factorial design by using the data from the given level of C only. The same 
applies to simple effects of A or C at a given level of B, or of B or C at a given 
level of A, since for any single level of any factor the design is a simple fac- 
torial design. For comparisons of individual means (of cells, rows or col- 
umns) within any single level of any given factor, the approximate t-test 
of the type given on page 272 may be applied, using ms, ceils from the entire 
table. It is left as a possible exercise for the student to demonstrate the 
validity of each of these tests in ways similar to those employed with previous 
mixed designs. 

Uses of Type V Designs: The Type V design provides another means (in 
addition to Type IV) of counterbalancing both the effects of order and of 
criterion differences when only one experimental factor is involved. In 
many such experiments, all interactions with order and with criterion differ- 
ences may prove nonsignificant, and the assumptions may otherwise be 
plausible that no such interactions exist. In that case, these interactions 
may be pooled with the error terms, and the sums of squares for the pooled 
error terms are 

SSerror œ) = $88, df = (an — 1), 
and 
SSerror (w) = S$us — $84 — SSB — SSc, df = (a — 1)(a?n — 3). 


Type VI Designs 


If it is possible to administer all combinations of two of the factors to the 
same subjects, an experiment with three factors may be designed so that 
subject differences are controlled in all effects except the main effect of one 
factor (C). For the purposes of the experiment, the subjects are divided at 
random into c groups, each of which takes one level of C in combination 
with all possible combinations of A and B. This design is diagrammed 
below for a = 2, b = 3, and c = 4. 


Type VI 


Within Effects: A, B, AB, 
AC, BC, ABC 


Between Effects: C 1,2,3,4 | 1,2,3,4 


1,2,3,4 | 1,2,3,4 


1,2,3,4 | 1,2,3,4 


Ay A Disregarding C 


TYPE VI DESIGNS 293 


Analysis of Total Sum of Squares: This design represents a combination of 
the A X B X S design (page 237) and the Type I design (page 267). 

When the C classification is disregarded, the design reduces to an A X B x S 
design, in which A, B, and AB are all “within” effects. 

When the B classification is disregarded (regarding all data as projected 
into the top face of the parallelepiped), the design reduces to a Type I design, 
in which A and AC are “within” effects and C is a “between” effect. 

When the A classification is disregarded (regarding all data as projected 
into the right face of the parallelepiped), the design again reduces to a Type I 
design, in which B and BC are “within” effects and C is a “between” effect. 

In Chapter 9 we noted that in the A X B X S design, the appropriate 
error term for A is AS, for B is BS, and for AB is ABS. "That is, we noted 
that A, B, and AB each required its own error term (except on the assumption 
that the AS, BS, and ABS interactions are thesame). In the Type VI design, 
which is in part an ABS design, it is again necessary to provide separate 
error terms for A, B, and AB. For much the same reason, it is necessary to 
provide separate error terms for AC, BC, and ABC as well. 

“ Pseudo Replications": To show how the total sum of squares may be 
analyzed into the components needed for these error terms, we shall resort 
to a device which we will call that of “pseudo replications.” Suppose that 
after the experiment has been conducted, one subject is selected at random 
from each of the C groups, and that the c subjects thus selected are regarded 
as constituting a single “replication” of the experiment. A second “replica- 
tion” is formed by selecting one subject at random from the remaining sub- 
jects in each group, and further replications are similarly constituted until 
n replications have been formed, The diagram for a single replication would 
look just like the preceding diagrams except that the groups (G) would be 
replaced by individual subjects (S). Thus, the individual replications could 
be diagrammed as follows, the numbers on the faces of the parallelepipeds 
now identifying individual subjects rather than groups of subjects. 

Since these replications are not actual replications, in the sense of repli- 
cations identified in the planning of the experiment and separately admin- 


Replication 1 Replication 2 Replication 3,....4, 5, etc. 


294 CONTROLLING INDIVIDUAL DIFFERENCES 


istered during the experiment, we have called them “pseudo” replications. 
"They are, nevertheless, strictly random replications, and they are " simple" 
replications as well. Hence, under the conditions specified on page 191, the 
significance of any effect not involving R can be tested against its interaction 
with R, granting that the interaction is not a mixed effect. 

We have thus constituted the original three-dimensional design into a 
four-dimensional design, and the analysis is now into the following com- 
ponents: 


A AB ABC  ABCR 
B AC  ABR 
CBG ACR 
RAR. BCR 
BR 
CR 


Now it can be shown ' that in the population, AR = ACR, BR = BCR, and 


! To save space, a complete proof of this will be omitted here, but the nature of the 
proof will be suggested. Consider these effects for only the first two levels of each 
factor. For any given replication, the A effect then depends upon the comparison 


A As 


(AiBiC, + A,B,C, + A:BoC, + A, BC») — (AsBiC, + AsB3C; + ABC, + A2B:C2) 
= (AiBiG, + AiB:C, — A2B,C; — AsBoC)) + (A,B,C + AB; — AsBiCs — A2B:C:) 
- u + v 


u and v representing the last two terms in parentheses. Now the AR interaction in the 
population is measured by the variance of the A effects for all possible replications, that 
is, the AR interaction is measured by oi. +» But in each replication, u and v are 
based on different randomly selected subjects, so that u and v are independent of one 
another, or uncorrelated. Accordingly, o% +» = o2 + g2 and hence, AR is measured 
by oi + o». 

For any given replication, the AC interaction effect depends upon the comparison 


A16 AQ, AC, AC, 
[(A,BiGi + AıB:01) — (AsB,C, + AsB40)] — [(A1 B,C. + AB2C2) — (AsB1Cs 4- AsB3C;)] 
= (Ai B,C, + A,B;C; — AsB,C; — A2B:C1) — (AiBiCs + ABC — AB, Cs — AsB3C;) 


= u a v. 


The ACR interaction depends upon the variance of the AC effects for all possible repli- 
cations, that is, in the population, ACR depends upon cî, -= o2-F 62 Accord- 
ingly, AR = ACR. If AR = ACR for the first two levels of each factor, and if each of 
these interactions is homogeneous for all levels, then these interactions must be the 
same for the table as a whole. By a similar reasoning, it may be shown that BR = 
BCR, that ABR = ABCR, and that R = CR. 


TYPE VI DESIGNS 295 


ABR = ABCR, and that the R effect is equal to CR. This being the case, 
the corresponding mean squares obtained in the experiment can differ only 
by chance, and hence can be pooled into more stable error terms. The 
error sums of squares may be denoted as: 


SSerror (v) = SSar F SS4CR 
SSerroro(w) = S$pR + SSBCR 
SSerrorg (w) = SSABR F SSABCR 


It is not necessary actually to constitute the data into “pseudo-random 
replications” in order to compute these error terms. The device of “pseudo 
replications" was used primarily as a pedagogical device, to enable the 
student to understand more readily why these terms are valid as error terms 
for the indicated effects. It may be shown that the sums of squares for the 
various error terms may be computed as follows: 


SSerror (v) = $8408 — $88 — $84 — SSAC 
SSerror (w) = SSECS — $88 — SSB — SSBC 
SSerrorg (w) = SSws — SSABO + SSC — SSerrory (w) — SS errorg (w) 

In these expressions ssacs, ssgcs, and ssazo have the same significance 
as in previous designs. That is, ssīcs is the sum of squares for “between 
A-C-S combinations." Each subject takes all A treatments in combination 
with only one C treatment, hence there are a such combinations for each 
subject. We will let 4,B,C,S, represent the criterion measure for Subject 1 


in Group 1 for the A,B,C; treatment combination, etc. The total for a single 
A-C-S combination for Subject 1 in Group 1 would then be 


Tas, = Xanyeys, + Xaos + - «+ X43, 048) 


Vor this subject we may then compute 


Thus; Ths t -oe + Thos: 
; à 


If we add these sums for all subjects in all groups, and subtract T?/N, we 
have sszcs. 

A similar procedure may be followed to compute sszcs. The sums of 
squares for error; (w) and errors (w) may then be readily computed, and the 
sum of squares for error; (w) may be obtained as a residual. 

The various components of the total sum of squares are classed as 
“between” and “within” effects in Table 15 on the following page, in which 
the degrees of freedom and the computational procedures are also indicated. 


296 CONTROLLING INDIVIDUAL DIFFERENCES 


TABLE 15 
Analysis in Type VI Designs 
df Sums of Squares 
Between- 
Subjects en —1 $88 
[0] mE SSc = SSG 
error (b) c(n — 1) SSerror (b) = $$8 — SSC 
Within- 
Subjects cn(ab — 1) SSws = SSr — $$8 
A a-l SSA 
B b-1 SSB 
AB (a—1)(b—1) | ssaz 
AC (a—1)(c— 1) |ssac 
BC (b—1)(c—1) |ssac 
ABC (a = 1)(b = DX SSABC 
(or) 
error (w) c(ab — 1)x SSerror (w) = SSus — SSA — SSB — SSAB — SSAC 
(n — 1) — 88npc — SSAnc 
= S$us — SSa3go + SSC 
error, (w) (a — 1)x SSerror, (w) = S$4c$ — 888 — SSA — SSAC 
(n — De 
errore (w) (b—1)(n— l)e |SSerrorg w) = S$8gcg — $88 — SSB — SSBC 
errors (w) (a—1)(b— J)X | SSerrorg (wy = SSerror (w) — SSerror; (w) — SSerrory co) 
(n — le 
Total aben — 1 88x 


Tests of Significance: In general, as in any random replications design, 
any homogeneous effect may be tested against its own interaction with R 
(on the assumption that the interaction effects with R are normally and 
independently distributed with the same variance for each treatment or 
treatment-combination). However, since more stable error terms may be 
obtained by pooling the interaction with R in the manner indicated, the 
various tests of significance are made as follows: A and AC against error (w), 
B and BC against error: (w), AB and ABC against errors (w), and C against 
error(b). 

If one is willing to assume on a priori grounds that for each level of C 
the AS, BS, and ABS interactions are the same except for chance, in which 
case it would follow ' that except for chance msa = msgr = msagz, one can 


1 Tf, in each replication, ms4s = msgs = ms,s, then, for all replications it must follow 
that ms4x = ms5g = M84 ze, Since each replication is an independent random sample of 
subjects. 


TYPE VII DESIGNS 297 


employ a single pooled error term for all "within" effects. In general, 
however, it is better not to pool the error (w) terms until it has first been 
demonstrated by means of the Bartlett test that there are no statistically 
significant differences among them. The only safe procedure in general is the 
relatively laborious one of analyzing the total sum of squares into all the 
terms shown in Table 15. 


Type VII Designs 


If it is possible but not desirable in a three-factor experiment, to admin- 
ister all treatment combinations to each subject, and if a — b, it is possible 
to control subject differences in the evaluation of all effects except the AB 
interaction effect. The subjects are divided at random into a groups of the 
same size, each of which takes ac of the possible a?c treatment-combinations. 
The design is illustrated below for a — b — 3 and c — 2. In this case, the 
subjects are divided into three groups. The design for any single level of C 
is a Type II design, but the same groups are involved at each level of C (as 
contrasted with the Type IV design in which different sets of groups were 
employed for each level of C). The upper left-hand figure below represents 
the design for level 1 of C alone. The design for level 2 of C is exactly the 
same. The parallelepiped to the right constitutes an alternative way of 
representing exactly the same design. It will be seen from the diagram that 
Group 1 takes A,BiG,, A,B,C, AsBoCi, AzBoCe, A;B;C;, and A;B;C), etc. 
The numbers on the faces of the parallelepiped identify the groups taking the 
corresponding treatment-combinations. 


Type VII 
a-b 


Mixed Effects: AB 


Within Effects: All 
except AB 


Disregarding A Disregarding B 


298 CONTROLLING INDIVIDUAL DIFFERENCES 


Analysis of Total Sums of Squares: It is evident from this diagram that 
the design is a combination of the A X B X S design on page 237 and the 
"Type II design. Disregarding the C classification (viewing the front face of 
the parallelepiped only), it is clear that the design reduces to a Type II de- 
sign, in which the main effects of A and B are “ within ” effectsand AB is “mixed.” 

Disregarding the A classification, it is evident that all subjects take each 
of the B-C combinations, or that the design reduces to a B X C X S design, 
in which all treatment effects (including interactions) are “within” effects. 
Similarly, disregarding the B classification, it is evident that the design re- 
duces to an A X C X Sdesign, in which all treatment effects are “within” effects. 

The “between” components of AB is the same as the between-groups com- 
ponent, that is, ms45(; = msg. The interaction of AB(b) with C, that is, the 
interaction of G with C, is clearly a “within” effect. The interaction of 
AB(w) and C is also a “within” effect, and hence, ABC is a “within” effect. 
Thus all of the effects are “within” effects except for a part of the AB effect. 

Since so far as B and C alone or A and C alone are concerned, the Type 
VII design is a treatments X treatments X subjects design (BX C X S or 
A X C X S), it is evident that several error terms may be required for A, B, 
C, AB(w), BC, AC, and ABC. It is possible, but not necessary, to compute 
these by again constituting the subjects into pseudo random replications, 
with a subjects in each replication, The diagram for the first replication 
would then be the same as the parallelepiped on the preceding page, 
except that the numbers on the faces of the parallelepiped would identify 
individual subjects rather than groups. The diagram for the second replication 
would be similar, except that 4, 5, and 6 would replace 1, 2, and 3, respectively, 
etc. An interaction with R could then be computed for each of the treatment 
effects and treatment interactions. Now it may be shown! that in the 
population the mean squares for AR, BR, and AB(w)R are the same — the 
term AB(w)R representing the interaction with R of the “within” component 


! Again, complete proof of these equalities will be omitted here, but the nature of the 
proofs will be suggested. 

Consider the design diagrammed on page 297. A single replication in this design is 
diagrammed in a different fashion below, 2; representing the criterion measure for 
A\B,C,, 2» that for A,B,C, etc. There are three subjects involved in each replication, 
the z's representing the criterion measures for one subject, the y's those for another, 
and the z's those for the other. 


Ai As A; Ai Ay A; 


B 
B: 
B; 


So far as A; and A: alone are concerned, the A effect for a single replication is meas- 
ured by 


TYPE Vil DESIGNS 299 


of AB. It may also be shown that the population mean squares of ACR, 
BCR, and AB(w)CR are the same, as are those of CR and AB(b)CR. 

Accordingly, the sums of squares for each of these sets of interactions 
with R may be pooled to form more stable error terms, as follows: 


SSerror, (w) = SSAR + SSBR + SSAB(u)R 
SSerrory (w) = SScR + SSABQ)CR 


SSerrorg (w) = SSacr + SSpcR + S$AB(o)CR 


A A: 


(ni a+ ys + ti +25 + ye) — (Yı + 22+ Z+ Yi ts + 2) 
= (a — t + ti — 2s) — (yı — Ys + Yı — Ys) + (22 — Zs + 25 — Z6). 


So far as A; and 4; are concerned, the AR interaction in the population is measured 
by the variance of the A effects for all possible replications in the population. That is, 


AR = koia -ntaa — (n-ne + (n tts t 
= ho -mtn e) thoy nou) T EG nn -e (a) 
= k(o; + 0, d- 0.) 


in which gj = Giz, — z, + s — z) €tc., and kis a constant, dependent upon a and c, which 
we need not identify now. 
In a similar fashion, it may be shown that so far as A; and A; alone are concerned 


AR = lots estere) T pepe vp F Ga igh ag ap (b) 
= klo? + 02 + 0%) 
and that, so far as A» and A; alone are concerned, 
AR = lois, spesso + konin V RO gorge) ©) 
= klo} + o? +03)- 


If AR in the population is homogeneous, so that the values of AR in (a), (b), and (c) 
are the same, then it follows that 


AR- Keik ettet oit eid ddr] 
In an exactly similar fashion it may be shown that BR for all levels of A is given by 
BR - Hoi e cii ei od totototo] 
and that hence, AR — RB. 


The nature of the proofs that AR(= BR) = [AB(w))R, that ACR = BCR = 
[AB(w)]CR and that CR = [AB(b)]CR are similar to the preceding. 


300 CONTROLLING INDIVIDUAL DIFFERENCES 


As with the Type VI design, it is not necessary actually to constitute the 
data into pseudo random replications to compute these error terms. Rather, 
they may be computed in the manner indicated in Table 16. To compute 
SSerror, (w), One must first compute ss37s, which is the sum of square for between 
A-B-S combinations. In the design diagrammed on page 297, there are 
three such combinations for each subject in Group 1, and the total for each 
combination is the sum of the two criterion measures for the two levels of C 
within this combination. For example, the total for the combination A;B,5; 
is the sum of the criterion measures for A;B;C,S; and A1B1C;S;. 

The analysis of the total sum of squares is given in Table 16 below. 
It will be noted that when a — 2, the degrees of freedom and sums of squares 
involving AB(w) vanish, and that AB becomes entirely a "between" 
effect. 

Tests of Significance: As in any random replications design, the appropriate 


TABLE 16 
Analysis in Type VII Designs 

Source df Sums of Squares 
Between- 
Subjects an-—1 Ssg 

AB(b) a-l SSAB) = SSG 

error (b) | a(n — 1) SSerror (t) = 8$8 — SSB) 
Within- 
Subjects an(ac — 1) SSws 

A a-l SSA 

B a—1 SSB 

C c—1 SSc 

AB(w) (a — 1) (a — 2) S8AB(w) = SSAB — SSABQ) 

AC (a — 1)(e — 1) S8ac 

BC (a — 1)(c — 1) SSBC 

AB(b)C | (a— 1) (c — 1) SSABQ)C = S$ac 

AB(w)C | (a — 1)(a — 2)x | ssaztwo = SSApc — SSAn G) c 

(c — 1) 


error (w) | a(ac — 1)(n — 1) | SSerror (w) = SSws — SSBT + SSAB 0) 

error; (w) | a(a — 1)(n — 1) | S8error, (w) = SSABS — SS — $84 — SSB 

— SSAB (w) 
error: (w)| a(c — 1)(n — 1) | SSerrorg (w) = $868 — 88g — SSe — SSAB @ C 

= SSps — $$s — SSTG + SSG 

errors (w) | a(a — 1)x SSerrorg (w) = SSerror (w) — SSerrory (w) — SSerrorg (w) 
(c - 1)(n — 1) 


Total a'cn — 1 ssr 


ADDITIONAL DESIGNS 301 


error term for any homogeneous effect is its interaction with R, but because 
of the pooling of equal interactions, the various tests of significance are 
as follows: A, B, and AB(w) against error; (w); AC, BC, and AB(w)C against 
errors (w); C and AB(b)C against errors (w); AB(b) against error (b). 

If a Bartlett test fails to reveal that the three "within" error terms are 
heterogeneous, the pooled error term, mSerror (w) may be used as a common 
error term for all of the “within” effects. The error term for AB(b), which 
iS MSeror œ), Consists of the R and the ABR(b) effects, but these are properly 
pooled into a single error term since it may be shown that their mean squares 
cannot differ except by chance. 

Simple effects of either A or B at a given level of C, or the simple AB inter- 
action at a given level of C, are tested against error; (w). 

Simple effects of C for a given level of A or B or for a given AB combination 
are tested against errors (w). 

The simple effects of A for a given level of B are tested against error (b) 
computed in the Type I design for that level of B only. The simple AC in- 
teraction for a given level of B is tested against error (w) computed in this 
Type I design. Simple effects of B and the simple BC interaction at a given 
level of A are similarly treated. 

The simple effects of A for a given BC combination are tested against 
mMS8wa for the given BC combination only. Simple effects of B are similarly 
treated. 


Summary of Two-and Three-Factor Designs 


It may be well at this point to summarize in table form the characterizing 
features of the mixed designs that have thus far been considered. "This has 
been done in Table 17, page 302 From this table the experimenter can more 
readily’ select the design appropriate to his peculiar experimental conditions. 


Additional Designs 


Table 17 by no means presents all of the mixed designs that may be devised 
for two-and three-factor experiments. For instance, in a three-factor experi- 
ment in which a — b — c, it is possible to divide the subjects at random into 
a groups, each group being given a? of the possible aë treatments. This design 
is diagrammed at the bottom of page 302 for the cases in which a — 2 and 
4-3. In this design, when any one factor is disregarded, the design re- 
duces to a treatments X treatments X subjects design, so far as the other 
two factors are concerned. Hence, all effects except ABC are “within” 
effects, while ABC is mixed. 


302 CONTROLLING INDIVIDUAL DIFFERENCES 


TABLE 17 


Summary of Two- and Three-Factor Mixed Designs 


Numbers of Special “Within” “ Mized” “ Belween” 
Type Factors Condition Effects Effects Effects 
I (page 267) 2 A B 
AB 
II (page 273) 2 a=b A AB* 
B 
III (page 281) 3 A B, C, BC 
AB, AC, 
ABC 
IV (page 285) 3 a-b A, B, AB* C 
AC, BC ABC* 
V (page 288) 3 a-b-c|A, B,C AB*, AC* 
BC*, ABC** 
VI (page 292) 3 A,B c 
AB, AC, BC 
ABC 
VII (page 297) 3 a-b A, B, C, AB* 
AC, BC 
ABC 


* A “between” effect when a = 2. 
** A “within” effect when a = 2. 


The pseudo replications device must be employed to provide separate error 
terms for the various treatments effects. 


En As 
a aaa 


It may be noted also that Table 17 is concerned only with designs in which 
either all treatments in the same classification are given to the same subjects, 
or each is given to different subjects. For example, either all A comparisons 
are “within” comparisons or all are “between” comparisons, There are 


ADDITIONAL DESIGNS 303 


other designs in which each subject may take some but not all of the treat- 
ments in any given classification. Suppose, for example, that in a simple- 
randomized design involving three treatments, it is possible to administer 
all of the treatments to the same subjects, but it is not desirable to do so, 
because doing so would place undue demands upon the subjects, or would 
be impracticable for other reasons. Suppose that, for some reason, it is prac- 
ticable to administer no more than two treatments to each subject. A design 
like that diagrammed below might then be employed. The subjects are 
divided at random into three groups of the same size, and each group is given 
two of the three treatments. Group 1 takes Treatments A; and A», Group 2 
takes Treatments A; and A;, and Group 3 Treatments A, and As. 


A, As As 


The treatments effect in either row of this design is obviously mixed and 
hence heterogeneous. In the upper row, two of the treatment-comparisons 
are between-subjects comparisons and one is a within-subjects comparison. 
For the table as a whole, however, the treatment-comparisons are homo- 
geneous, even though each is a mixed comparison, since each is half a within- 
subjects and half a between-subjects comparison. Accordingly, by means of 
the device of “pseudo” random replications, a valid error term can be found 
for testing the treatments effect. The test would be given by F = ms4/msan, 
in which R represents the pseudo random replications. Whether or not this 
design is more efficient than a simple-randomized design depends on the 
correlation between the different criterion measures obtained for the same 
subjects. If this correlation were quite low, the design could be less efficient 
than a simple-randomized design, due to the loss of the degrees of freedom 
which are associated with “between (pseudo) replications” and the “residual” 
components which are not included in the error term. For example, if 9 
subjects were used, there would be only 4 degrees of freedom available for 
error out of the total of 17 — 5 = 12 degrees of freedom not associated with 
the treatment effects. 

"This idea is carried further in the 3 X 2 X 2 design diagrammed on page 304, in 
which the subjects are divided at random into 6 groups of the same size and 
each group takes 6 of the 12 treatment-combinations. It is evident from this 
table that all main effects are “within” effects, as are the AB and AC inter- 
actions. It is evident also that the comparison of any two B-C combina- 
tions consists in part of "between" and in part of "within" comparisons, 
but that these are in the same proportion for each comparison of B-C com- 
binations. To secure valid error terms for the various effects, it is necessary 
to employ the pseudo replications device. It should be evident from what 


304 CONTROLLING INDIVIDUAL DIFFERENCES 


has just been said that each of the various interactions with R is a “mixed” 
effect. In this design, as in the preceding, there is a considerable loss of 
degrees of freedom from the error term. 


3x2 Bic, 
BC and ABQ PO 
Confounded 


B4C 


BoC, 


The designs that have here been presented as examples have not, to the 
writer’s knowledge, ever been employed in educational and psychological 
research, but they suggest still further designs that might have interesting 
possibilities in these fields. 


Partial Confounding 


We have noted that in two- and three-factor experiments it is possible to 
leave subject differences uncontrolled in evaluating a selected effect. That 
is, we may let any selected single effect (except the triple interaction in three- 
factor experiments) be a “between” or a “mixed” effect, all of the remain- 
ing effects being “within” effects. This is true, of course, granting that 
any possible combination of treatments is administrable to a single subject. 
This suggests that instead of leaving the same effect uncontrolled with all 
subjects, we might let one effect be uncontrolled with some subjects, another 
effect with others, and still another effect with still other subjects, etc. For 
example, in a 2 X 2 X 2 factorial experiment, one might divide the subjects 
at random into three sets of the same size, letting AB be a mixed or between 
effect in the first set, AC in the second, and BC in the third. The (within) 
mean square for each interaction would then be computed only for the data 
from those sets in which this interaction is a within-effect, the mean squares 
for main effects being computed from the data for all sets. Thus, in this 
particular example, no information about main effects would be sacrificed, 
while two-thirds of the information available on main effects would be avail- 
able on each interaction. This is an example of what is known as partial 
confounding. 

The technique of partial confounding has been widely used in agriculture, 
but thus far the writer has found no reported instance of its use in the research 
literature of education and psychology. 


MIXED HIGHER-ORDER EXPERIMENTS 305 
Mixed Higher-Order Experiments 


It should now be apparent that there is a very large number of possible 
eombinations of the basic two- and three-factor designs in factorial experi- 
ments involving four or more factors. Any three-factor design, for example, 
can be replicated independently for each level of a fourth factor, or for each 
combination of a fourth and fifth factor. One of the many possible designs 
for four-factor experiments is diagrammed below. The supplementary dia- 
grams suggest how that design might be viewed as a combination of more 
basic designs, Another design is diagrammed on page 306. This design is 
of particular interest as an example of an “incomplete” factorial design — 
not all of the possible treatment-combinations being represented in the de- 
sign. This design may be regarded as a combination of the Type I and Type V 
designs, each level of B being imposed on a different one of the “blocks” 
represented on page 306. This design is useful on the assumption that the 
factor D does not interact with any of the other factors or their interactions. 
Because of the incomplete character of the design, the four-factor interaction 
cannot be computed. However, the main effects and the two- and three-factor 
interactions can be computed and tested. Several of the interactions are mixed 
and must be analyzed into their homogeneous components. All “between” 
effects involving D, including the main effect of D, can be pooled into a single 
“between” error term, and all “within” interactions involving D may be 
pooled into a single "within" error term. In experiments of this type, this 
design can be used to control order and crilerion effects, since if D does not 
interact with A it may be safely assumed not to interact with any of the 
other effects. 


2* Design: Four Factors: AB and CD Confounded 


Disregarding D Disregarding B 


504 800, 


ete; 


Disregarding C and D — Disregarding B and C Disregarding A and B 


306 CONTROLLING INDIVIDUAL DIFFERENCES 


EH As " As 
GEO “CCCs OIN Ca Cs 


Di 2 5j 
34 Design Bi D: 6 4 
Four Factors D; 9 T 8 
Assumes no D; 3 1 2 
Interactions 
with D Bi D: | 4 5 6 
AB, AC, CB D; 8 9 7 
and ABC Jf E E T E 
Confounded Di 2 3 T 

Bs D: 6 4 i 
bt | 8 9 


STUDY EXERCISES* 


1. An experiment ? was carried out to investigate the learning of a T-maze 
by white rats under conditions of secondary reinforcement. "Twenty female 
hooded rats, under 22 hours of hunger, were given 10 trials per day in a straight 
runway for seven successive days. Seven of the trials each day were to a 
white goal box containing food, the other three to a black goal box without 
food. Under this procedure “white” presumably acquires secondary rein- 
forcing properties, i.e., “white” can serve in lieu of food as a reward for 
hungry rats. 

Upon completion of this preliminary training, the animals were shifted to 
a T-maze with a white goal box in one arm, a black goal box in the other, 
neither of which contained food. Ten of the original 20 animals were selected 
at random and given four trials per day in the T-maze for 20 successive days, 
each day under the same hunger condition as during training. The other 
10 rats were given a similar set of trials, but were satiated with food immedi- 
ately before each day’s trials. For both groups, the first two daily trials in 
the maze were free-choice, the next two forced in such a manner as to insure 
that each animal made the same number of right and left turns each day, thus 
precluding development of a direction preference. The criterion measure was 
defined as the number of white choices on the daily first trials in each five-day 
period. The apparatus is shown in the following diagram. 


1 See second paragraph on page viii. 
2 E, R. Dusek, Learning With Secondary Reinforcement Under Two Different Strengths 
of the Relevant Drive, M. A. Thesis, State University of Iowa, 1949. 


STUDY EXERCISES 307 


The social boxes each contained another rat and served to motivate the 
animals to move through the maze. These boxes were interchanged each 
day. To preclude the influence of inherent turning preferences, the white 
goal box was always in the left arm of the maze for a random half of each 
group of 10 rats, in the right arm for the other half. 
The data are as follows; 
Trial Categories 
T Ts Ts Ti 


Daysl-5| 6-10 


Hungry (H) 


C Co 7r) HH 
h9 Q' &1 $2 C1 Gi &i r3 L2 w 
moa wee A 

C or) Cn O4 UU Ew 


N 
e 
id 
E 
g 
eo 
N 
a 


Satiated (H») 


tmo gtgm5oonu 
bo ho d» F2 Q2 Cn b[Q SOb bo 
Q9» b) e bo Em Sw 


m 
Dlo wnan onw 


to 
Hu 
to 
a 
to 
N 


308 


CONTROLLING INDIVIDUAL DIFFERENCES 


Computations Provided: 


a) 


b) 


d) 


e) 


g) 


h) 


i) 


Daysl-5 6-10 11-15 16-20 
SX 23 37 38 27 
H, SX H 155 158 105 
(XXy/10 625 1369 M44 729 
xx G 21 27 22 
Hs Tx 41 63 87 60 
(XXy/10 289 | M1 729 484 


Complete the analysis and prepare a summary table. 


May the hypothesis of no HT interaction be rejected at the 5% level? 
Calculate the TS interaction effects for the first three observations 
in the H,T; cell of the table. The test of the HT interaction involves 
what assumption concerning such interaction effects? 


What are the numerical values of the over-all Tı, T», etc., means? 
Test (at the 5% level) the hypothesis of no differences among the 
corresponding population means. 


Define the T, treatment population. Need one specify anything about 
the proportion of animals in this population subject to the H, and H: 
conditions? Explain. 


Explain why one may nol conclude from the test of (c) that the animals 
had "learned" to choose the white arm of the maze. Carry out a test 
that will permit a tentative conclusion about this matter. 


"Test the simple trial effects for each hunger condition (recognizing that 
this test would ordinarily not be made when the interaction is non- 
significant). May one infer from these results that “learning” took 
place in one group and not in the other, or that there is a real T-effect 
at one level of H, but not at the other? Explain why there is no real 
inconsistency between these results and the results obtained in (b). 


Test (at the 5% level) the hypothesis of equal over-all population means 
for the H, and H; conditions. Define the populations involved and 
specify the distributions to which the assumptions of homogeneity and 
normality apply. 


Test the simple effect of H at the T» level, using the data for this level 
only. Explain why the same error term is not valid for testing both the 
simple and the main effects of H. 


Compute the TS mean square for H, alone. For H;alone. What is the 
relation of these two mean squares to MSerror (w) computed for the entire 
experiment? 


STUDY EXERCISES 309 


j) Which of the main effects in this experiment is more precisely evaluated? 
Explain. 

k) What principle of experimental design was violated in this experiment 
in connection with counterbalancing the social boxes and position of 
goal box? 


2. One purpose of a study carried out by Heyman ! was to investigate trans- 
position behavior in white rats trained to discriminate between visual stimuli 
differing in “brightness.” Rats were trained in a T-maze to choose the arm 
associated with the brighter of a stimulus pair, and were then tested on other 
stimulus pairs whose brightness ratio remained constant but whose absolute 
brightness values were in all cases higher than those used in training. The 
apparatus was a simple elevated T-maze with a glass surface over the stimulus 
area as shown in the following sketch. 


The stimulus conditions consisted of different brightness levels produced by 
the reflection of the overhead light from paper strips of various shades of 
grey placed on the alley floor. The rat was required to choose the brighter 
arm in order to obtain food. The stimulus conditions used were as follows: 


Brighlness of Arms in 


Condilions Apparent Fool-Candles 
Training 33 and .98 
Transposition (A) .69 and 2.05 
Transposition (As) .98 and 3.11 
Transposition (A3) 2.05 and 6.11 


Fifteen female hooded rats were selected, all of which had previously been 
used in a bar-pressing experiment. Preliminary procedure consisted of 
exploration, adjustment to a feeding schedule, and four reinforced trials to 
each arm of the maze with no stimuli in place. The training procedure con- 
sisted of 20 trials per day with the training stimuli in place. Responses 


1M. N. Heyman, An Investigation of Hypothetical Generalization Gradients in a 
Visual Transposition Situation, M. A. Thesis, State University of Iowa, 1951. 


310 CONTROLLING INDIVIDUAL DIFFERENCES 


to the brighter side (correct response) were rewarded with a pellet of food 
at the end of the arm. The brighter stimulus appeared equally often in each 
arm of the maze. The training was continued until an animal made 18 correct 
choices out of 20 trials with the last 10 correct. 

Upon reaching the training criterion, the animal was given 10 rewarded 
trials on one of the three transposition conditions (A1, As, and As). After 
this series of trials was completed, the animal was returned to the original 
training condition until it ran 10 consecutive correct trials. The animal 
was then given 10 trials on the second transposition condition, then more 
training, followed by 10 trials on the third condition. The order (but not 
sequence) of the administration of the transposition conditions was counter- 
balanced by dividing the animals at random into three equal groups, and 
assigning one to each of the orders 414545, AsA4;4», and AsAs4,. In other 
words, a Type II design was employed, with A (Transposition conditions) 
and O (Order) as the experimental factors. 

The table below presents the number (X) of “bright” responses in 10 trials 
on each transposition condition for each animal, 


Group 1 Group 2 Group 3 

AiO; A50; A303 A301 AO: AsO; A201 A302 A10; 

9 8 10 5 7 9 9 7 10 

10 9 T 4 10 6 9 5 10 

10 8 6 5 8 9 9 7 10 

T 10 6 7 10 9 8 6 9 

Xs 10 8 5 8 10 5 5 5 6 
XX 46 43 34 29 45 38 40 30 45 


Xx X/5 9.2 8.6 6.8 5.8 90 7,6 80 6.0 9.0 
ix 430 373 246 179 4413 304 332 184 417 
(27Xy/5 423.2 369.8 231.2 288.8 168.2 405 180 405 320 


a) Analyze the results and prepare a summary table. 


b) Test (at the 1% level) the hypothesis of no AO interaction. Define the 
population involved in this test. 


c) In Type II experiments in general, if one component of the interaction 
is found significant and the other is not, what are the two possible 
interpretations of this finding? Which interpretation would you usually 
favor in experiments of this kind? Explain. 


d) Suppose that the five animals in each group had been kept in the same 
living cage during the course of this experiment, and that extraneous 
factors affecting the cages systematically had produced differences in 


STUDY EXERCISES 311 


their criterion means. To what "effect" in the analysis would these 
cage differences contribute? Is it possible, then, that the “within” 
component of AO is zero in the population, while the between" com- 
ponent is real (other than zero)? Explain. 


e) Had the “within,” rather than the “between,” component of AO been 
found significant, could one likewise plausibly contend that the “within” 
component is real while the “between” component is zero in the popu- 
lation? Explain. (See page 274.) 

f) How do the assumptions underlying the test of one component of the 
AO interaction differ from those for the other component? If there 
were an extreme violation of the assumption of homogeneity of variance 
underlying AO(b), how would it show up in the preceding table? 


g) What “corrections” must be applied to the data (see page 280) in ordcr 
that the corrected data may be regarded as corresponding to a simple- 
randomized design involving only the order effect? How do you know 
that the ratio of the mean squares for between-orders and for within- 
orders computed for the corrected data must be equal to the mean square 
ratio used in the preceding test? 


h) Test (at the 1% level) the main effect of the transposition conditions (A). 
Do the assumptions underlying this test differ from those underlying 
the test of the “within” component of A0? Explain. 


i) What advantage is gained in this experiment by counterbalancing the 
order effects as opposed to randomizing them? Justify your answer. 


j) What error term would you use to test the simple effect of A for 0:? 
Why not pool such error terms for the three levels of O to obtain a m re 


stable term? 


k) If you had occasion to test the significance of the difference between 
the means for A;0; and A202, what error term would you employ? 
Explain. 

1) The mean square for O in this experiment is "significantly" smaller 
than that for error (w), and at a very high level of significance. In 
general, granting that the experiment was well designed and properly 
administered, how does one account for a finding of this kind? 


3. An experiment was concerned with the effects of massed and distributed 
practice on the learning of successive lists of adjectives. Each subject learned 
a different list of 14 two-syllable adjectives each day for four successive 
days. The words were presented by memory drum at a two-second ex- 
posure rate. Practice was continued each day until the subject correctly 
anticipated all words in the list on a single trial. The learning score for 
each subject each day was the number of trials needed to reach this criterion. 


312 CONTROLLING INDIVIDUAL DIFFERENCES 


The purpose of the experiment was to determine if the time required to 
learn a list decreased more rapidly under one distribution of practice than 
under the other. 

Forty-eight college students were assigned at random to one of two groups 
of 24 each. The subjects of one group learned a list each day under con- 
ditions of massed practice (P;), with a two-second interval between trials. 
The other group learned each list under distributed practice (P2), with a 
30-second interval between trials. Each of these groups was randomly sub- 
divided into four subgroups of six subjects each, and one sub-group was 
assigned to each of four orders in which the lists were learned. In other 
words, “lists” were counterbalanced with respect to “days” in each group.’ 

The sum and the mean of the criterion scores for each LDP combination 
are given in the following table. The total sum of squares is 15,788 and 
that for between-subjects is 6,884. 


Massed Practice (P1) Distributed Practice (P5) 
List Day Sum Mean List Day Sum Mean 
I 1 198 33 I 1 150 25 
Subgp IL 4 120 20 | Sugp Il 4 84 14 
1 II 2 156 26 5 HI 2 108 18 
IV 3 150 25 IV 3 126 21 
I 2 144 24 I 2 120 20 
Subgrp II 3 14 24 Subgrp II 3 156 26 
2 II 1 186 31 6 MI 1 138 23 
IV 4 114 19 IV 4 150 "5 
I 3 120 20 I 3 96 16 
Subgrp II 2 120 20 Subgrp H 2 84 14 
3 Til 4 126 21 7 Til 4 90 15 
IV 1 186 31 IV 1 150 25 
I 4 90 15 H 4 114 19 
Subgrp II Tip K126 2L Subgrp II 1 90 15 
4 III 3 114 19 8 Hn 3 66 1l 
IV 2 126 21 IV 2 90 15 


a) Complete the analysis and prepare a summary table. 


b) May the hypothesis of no LD interaction be rejected at the 5% level? 


1 The data in this exercise are hypothetical, but the situation is much like that re- 
ported by Benton J. Underwood in Journal of Experimental Psychology, vol. 42; 
(November, 1951), pp. 291-295. 


STUDY EXERCISES 313 


c) What characteristic of the lists might cause List I to be more difficult 
to learn on the first day (when preceded by no other lists) than on a 
later day (when immediately preceded by List III)? Would this differ- 
ence affect one or both components of the LD interaction? Explain in 
terms of the analysis of the 4 X 4 square on page 274. 


d) Suppose that there is an intrinsic interaction (of the type just suggested) 
in this experiment, but no extrinsic interaction. How, in that case, 
could you explain why one component of the LD interaction is signifi- 
cant, while the other is not? 


e) Can you suggest any specific extraneous factors or circumstances that 
would increase (or decrease) mszpo) without affecting mso)? 


f) Can you find any plausible way of explaining an intrinsic LD(b) interac- 
tion and a zero LD(w) interaction in an experiment of this kind? 


g) Which of the preceding interpretations (d, e, and f) of the results of (b) 
do you prefer? Why? 


h) Test (at the 5% level) the null hypotheses regarding the main effects of 
P, D, and L. Does it appear important to have “taken out” List differ- 
ences? Could this have been known in advance of the experiment? 


i) Specify the assumptions underlying each of the tests in (h). 


j) Is there any fundamental interest in this experiment in the mixed inter- 
actions for their own sake? In any of the List effects? 


k) If all effects involving Lists were ignored, to what type of design would 
this reduce? What would be the numerical values of mSerror () and 
MSerror w), if the data were so analyzed? Would the interpretation of the 
P, D, and PD effects have differed? 


1) What is the “order” effect in this experiment? Why is it not isolated 
as a factor apart from treatments? 


4. An experiment carried out by Kalish ! was designed to investigate the 
effects of varying numbers of acquisition and extinction trials on the strength 
of anxiety in white rats. The apparatus used for the acquisition and ex- 
tinction trials is shown in the sketch on page 314. Four identical boxes were 


1H. I. Kalish, Strength of Fear as a Function of the Numbers of Acquisition and 
Extinction Trials. Ph.D. Thesis, State University of Iowa, 1952. 


314 CONTROLLING INDIVIDUAL DIFFERENCES 


used, all operated by the same controls so that four animals could be run 
simultaneously. 
In each acquisition trial the animal was placed in the box and the buzzer 


was sounded for a five-second in- 


terval, during the last second of 


which the animal was given an 
electric shock through the grid in 
the floor of the box. Each extinc- 
tion trial was identical with the 
acquisition trial, except that the 
electric shock was omitted. 

A random fourth of the experi- 
mental animals were given 1 ac- 
CYPRUS SUA gid quisition trial, another fourth 3, 
another fourth 9, and the last 
fourth 27 acquisition trials. Within each acquisition category a random fourth 
of the animals received 0 extinction trials, another fourth 3, another fourth 9, 
and the last fourth 27 extinction trials. There were thus 16 different treatment- 

combinations, each administered to a different random group of animals. 

The entire experiment consisted of eight random replications. In each 
replication 16 animals selected at random from the available colony were 
identified by tail markings and placed two in each of eight living cages. 
The animals were then randomly assigned one to each of 16 treatment- 
combinations. On the first day of each replication, four animals assigned to 
the same acquisition category were taken from the living cages and placed 
one in each of the four boxes for a period of 81 minutes. The four animals 
then received the specified number of acquisition trials. The trials were 
spaced at three-minute intervals and were so arranged that the last trial 
(or the only trial in the case of one-trial acquisition) started at the end of the 
78th minute in the box. Immediately following the 81-minute period, the four 
animals were returned to the living cages and the process repeated successively 
with the three remaining groups of four rats (or for the three remaining levels 
of acquisition). Six hours were required to complete the running of the 
acquisition trials for the 16 animals in one replication. 

On the following day, the four groups of rats assigned to the four extinction 
categories were successively placed in the boxes for 81 minutes and given 
the appropriate number of extinction trials. Immediately following the 
extinction trials, each animal was placed in a box identical to the conditioning 
boxes, except for a 3 guillotine door at the end opposite the buzzer. After 
15 seconds the buzzer was sounded and the door raised, presenting a two- 
inch hurdle over which the rat could escape to an empty box. Each animal 
was given 12 such hurdle-jumping trials at five-minute intervals, and a latency 
measure was recorded for each trial. For purposes of analysis, the 12 trials 
were treated as four blocks of three trials each. The criterion measure for 


STUDY EXERCISES 315 


each animal for each block was 2 plus the logarithm of the mean latency 
for the three trials in the block. 

This basic experiment was repeated eight times over a period of six weeks, 
each time with a separate independent random sample of 16 rats from the 
same colony. Food and water were available at all times in the living cages 
during the course of each replication. In each replication the order in which 
the acquisition and extinction groups were run was randomly determined. 
All replications were handled by the same experimenter, and care was taken 
to see that identical procedures were followed in each replication. Each of 
the replications thus emplóyed a Type III design with T (trials) and all in- 
teractions with T as the “within” effects, and with the remaining effects — 
A (acquisition conditions), E (extinction conditions) and AE—as the 
"between" effects. The entire design is thus a mixed A X E X T x R design. 

The means (over-all replications) of the criterion measures for the four 
blocks of trials are given for each treatment-combination in the table to the 
left below. The upper right-hand table presents the mean of the criterion 
measures (over-all replications and extinction conditions) for each acquisi- 
tion condition, and the lower right-hand table presents the mean of the 
criterion measures (over-all replications and acquisition conditions) for each 
extinction condition. 

Mean Criterion Scores 
Trials Trials 


1-3 46 7-9 10-12 1-3 4-6 7-9 10-12 


AE, 
AEs 
AE; 
AE; 


AE 
AsEs 
AE; 
AE, 


1-3 4-6 7-9 10-12 


AXE, 
AE: 
AEs 
AE, 


AE 
AE 
AEs 
AE 


316 CONTROLLING INDIVIDUAL DIFFERENCES 


The following sums of squares are provided: 


ssa = 13.8518 ssas = 33.1120 Ssigg = 73.8277 
ssg = 13.3390 ssqp = 23.1999 ssagy = 46.0231 
ssp = 8.2200 ssar = 23.1194 ssar = 36.7313 
ssr = 3.8645 ssgy = 23.5163 ssppg = 43.3426 
ssgg = 29.2573 ssagpn = 101.8745 


ss7p = 12.8225 

a) Preparea summary table of the complete analysis, classifying the 
various main effects and interactions as “between " and “ within” effects, 
as has been done for each design in this chapter. 

b) Identify one of the totals involved in computing s$3z75- 

c) Apply tests of homogeneity to the “between” interactions with R, .. . 
to the “within” interactions with R. How do you account for the 
difference in the outcomes of these tests? 

d) In which case may the interaction with R be pooled to provide a more 
stable error term? On what assumption? Is any important advantage 
gained by such pooling in this case? Explain. 

e) Estimate the error variance of the difference between the means (over- 
all other factors) of two “acquisition” categories. Estimate the error 
variance of the difference between the (over-all) means of two “trial” 
categories. How many times as many cases would be needed to make 
the first of these error variances in an augmented experiment as small 
as the second is in this experiment? 

f) Test (at the 1% level) the hypothesis of no AT interaction in the popu- 
lation. Specify the assumptions underlying this test. 

g) May one infer from the test of (f) that there is an intrinsic AT inter- 
action? Explain. 

h) It appears that there is probably an AZ interaction in the population, 
since the observed AE interaction is significant at the 5% level. What 
inferences may one draw concerning the rank order of the acquisition 
categories from one extinction level to another?... about the rank 
order of the extinction categories from one acquisition level to another? 


i) Test the simple effect of A for Ej. For Ei. 
j) Suggest some extraneous factors that might differ systematically from 
replication to replication. Under what circumstances, in general, can 


such factors affect the treatment comparisons? Why do these sys- 
tematic factors seem of little consequence in this experiment? 


k) How are factors affecting cages systematically in each replication handled 
in this experiment? Do such factors give rise to extraneous interaction 
in this experiment? Could they "account for" a significant interaction? 
Explain. 


Analysis of Covariance 


Nature and Purposes of Analysis of Covariance 


Many of the designs thus far considered have involved the experimental 
control of a concomitant variable or variables while observations are being 
made of the effect of a given experimental variable. In the treatments X 
levels design, for example, the effect of a concomitant variable is controlled 
experimentally by so selecting the treatment groups that each shows the same 
distribution by levels of that variable. In the simple factorial (A X B) de- 
sign, the effect of the B factor is similarly controlled in comparisons of the 
A categories, since within each A category the distribution by B categories 
is the same. "There are some experimental situations, however, in which it is 
either impossible or impracticable thus to control a concomitant variable by 
direct selection of the subjects. Sometimes administrative conditions make 
direct control impracticable. For example, a school principal may permit an 
educational experimenter to administer different experimental methods of 
instruction to different school classes, but may require that he use the classes 
as they are already organized. That is, he may not allow the experimenter to 
reorganize the classes into “matched” classes for the purposes of the experi- 
ment, since such reassignments would introduce conflicts in other aspects of 
the students! class schedule. Sometimes, too, the variations in the concomi- 
tant variable do not arise or are not observable until after the experiment 
has begun. For example, in a methods of instruction experiment the con- 
comitant variable may be the number of hours spent in study by individual 
students — a variable which it may be practicable to observe and measure 
but not to control experimentally. 

In situations in which experimental control of a concomitant variable may 
be either impossible or impracticable, it is sometimes possible to resort to 
statistical control of that variable. That is, observations may be made of 
the uncontrolled concomitant variations and appropriate adjustments may be 
made in the criterion means for the various treatment groups, as well as in 
the error term used in the test of significance. This method of statistical 
control is that known as the method of analysis of covariance. 

317 


318 ANALYSIS OF COVARIANCE 


Suppose that in a simple randomized experiment with just two treatments, 
measures are available not only for the criterion variable (Y) but also for a 
concomitant (and uncontrolled) variable (X). For each treatment group, one 
could then compute the correlation (r,) between these variables, and the 
coefficient of regression of Y on X (b,.). If the treatments are alike in their 
effects on the criterion variable (which is the hypothesis to be tested) then, 
except for chance, these coefficients are the same for both treatment groups. 
We will call the estimate of the common true regression coefficient the within- 
groups regression coefficient (b), and of the common true correlation coeffi- 
cient the within-groups correlation. By means of the regression equation 
we can then compute for each subject the “expected deviation" of his Y- 
measure from the general Y-mean — that is, the deviation expected in con- 
sideration of the deviation (x) of his X-measure from the X-mean. 

Suppose that we then subtract from each subject’s actual Y-measure the 
expected deviation (bx) of that measure from the general Y-mean, and that 
we call the result (Y-bx) his adjusted criterion measure. The adjusted criterion 
measure for any individual subject would then be independent of his X- 
measure. A given subject may have a high X-measure, and his “expected” 
Y-measure may therefore also be high, but his actual Y-measure may be 
either above or below expectation. Similarly, the mean X-score, and hence 
the expected mean Y-score, may be higher for Group 1 than for Group 2, 
but yet the mean of the adjusted scores for Group 1 may be either higher 
or lower than that for Group 2. If we then find that the mean of the adjusted 
criterion scores is higher for the grovp that received Treatment 1 than for 
the group that received Treatment 2, we would know that the difference is 
not due to any difference in the X-factor for the two groups, but must be 
due either to chance fluctuations in random sampling or to the effect of the 
treatments (or possibly to the effect of some uncontrolled extraneous variable). 

Suppose we thus compute the adjusted criterion score for each subject in 
both treatment groups, and that we then apply the usual methods of analysis 
of variance, not to the actual criterion measures, but to the adjusted criterion 
measures, and that we finally apply an F-test to test the significance of the 
treatment difference. If the difference in adjusted criterion means is found 
to be significant, we can then be reasonably certain that it is due, not to 
sampling fluctuations, but to a real treatment effect (or, if the experiment 
has not been closely controlled, possibly to the effect of some uncontrolled 
extraneous variable). Thus, through a purely statistical control we can 
secure the same precision in the evaluation of the treatment effect as if we 
had experimentally controlled the X-factor by actually matching the groups 
with reference to X, or by constituting the subjects into levels on the basis 
of X and using the appropriate test in a treatments X levels design. 

The procedure just described is essentially that of the method of analysis 
of covariance. The method is simply an extension of the method of analysis 
of variance, applied to the adjusted rather than to the actual criterion meas- 
ures. Why it is called the method of analysis of covariance will be made 


BASIC FORMULAS 319 


clear in the following section, concerned with the basic formulas and com- 
putational procedures needed in the application of the method. 


Basic Formulas 


'The covariance of two variables for a given population is defined as the 
mean of the products of the deviations of the paired variables from their 
respective population means. The covariance for a sample is the mean product 
of the deviations from the sample means. The covariance of X and Y for a 
given sample is thus $ zy/N, in which z = X — Mx and y = Y — My are the 
deviations of X and Y from their respective sample means. 

The best ! estimate of the covariance of the population that may be obtained 


times the covariance of the sample, 


N 
INST 


just as the best estimate of the population variance is xe 


from a random sample of N cases is 


times the sample 


1 
variance. That is, the best estimate of the covariance of the population is 


Xuy/(N — 1). (The proof of this, which is closely similar to the proof 
that the best estimate of the population variance is 272?/(N — 1), is left as 
an exercise for the student.) Thus the degrees of freedom for the variance of 
X or Y and for their covariance are the same. 

We shall let sp represent the “sum of products" (sp = Day), just as we 
have previously let ss represent the sum of squares. In this case, ssx = Xa? 
and ssy = Dy’. 

If the total sample consists of a groups, the “total” sum of products may 
be analyzed into its between-groups and within-groups components, just as 
may the sum of squares of either X or Y. The proof is as follows: 

Let 2; represent the means of the z's for group i, and let 7; represent the 
mean of the y's. Then, for any subject in group i, 


(a — 2)(y— 4) = zy — sgi — YE + Ti 
Summing such expressions for all n; subjects in group i, we get 
X(r—2)(y-4)- Vay — HL — FLy + nH. 

Summing these expressions for all a groups, and simplifying, we get 

»»»- —4&)(y-y)- »» e » 
from which 

Loy = DLE- y- g) tns. (108) 

'Thus, we see that the "total" sum of producls (spr = Ere) of the 


! According to the least squares criterion. 


320 ANALYSIS OF COVARIANCE 


deviations of the X and Y measures from their respective general means is 
composed of two components. One of these (the first right-hand term) is the 
sum of the sums of the products for the individual groups of the deviations of 
the measures from their respective group means. "This we will refer to as the 


within-groups sum of products (spe). The other ( nog) is the sum of 
the weighted products of the deviations of the group means from their re- 
spective general means. This is the sum of products for between-groups (spa). 

Formula (108) may then be rewritten 

Spr = SPw + Spa. 

The formulas for computing the sums of products from the original meas- 
ures (X and Y) are closely similar to the formulas for computing the cor- 
responding sums of squares. The “total” sum of products is 

2 < TxT: 
spr = XYw-hrxv-4 (109) 


i=l 


in which the summation of the XY products is over all groups, and in which 


a a 


Tx = } ÙX, Ty = DY, and N represents the total number of subjects 
in all groups. zi 
Similarly, the sum of products for between-groups (A) is 
spa = 2, niji = EE zi LE (110) 
in which Ty, represents the sum of the X’s for group i, and n; represents the 
number in the group, etc. [The derivation of formulas (109) and (110) from 
the basic expressions in formula (108) is left as an exercise for the student.] 
The sum of products for within-groups is then obtained as a residual by 
Spo = Spr — spa. (111) 
It will be recalled that in the more complex experimental designs the 
analysis of the total sum of squares into a number of components consists 
essentially of successive applications of the method of analysis into two 
components. The same is true with reference to the analysis of the total 
sum of products. In any experimental design for which measures of two 
variables are available for each subject, the analysis of the total sum of 
products follows exactly the analysis of the total sum of squares for either 
variable. Thus, in the simple (A X B) factorial design, the sum of products 
for interaction (AB) is equal to the sum of products for between-combinations 
(between cells) less the sums of products for A and B, just as the sum of squares 
for interaction is equal to the sum of squares for between-combinations 
minus the sums of squares for the main effects, 
We may next recall, from simple correlation theory, that the coefficient of 
correlation between X and Y for a sample may be written 


pe pa e cem: 
NOSOqie y ET SST e 88r, 


BASIC FORMULAS 321 


in which ssr, and ssr, are the “total” sums of squares for the X and Y dis- 
tributions. Since, when a sample consists of a number of groups, Dy, 2", 
Dy? may each be analyzed into its between-groups and within-groups com- 
ponents, we may obtain three different correlation coefficients from the sample — 
one for total, based on the "total" sums of squares and products, one for 
between-groups, and one for within-groups, as follows; 


Tam = — (112) 

SST SST 

aadi ^. 

Tae cre m 
Visa 884, (113) 
and 
BTN (114) 
V880x* SSuY 


Try) = 


in which ssa, is the between-groups sum of squares for the X measures, etc. 

When all groups are of the same size, r.,() is simply the correlation be- 
tween the group means for X and Y, raw) is the “average” of the correla- 
tions between X and Y for the individual groups, and r,,(7) is the correlation 
computed for all the measures. 

We know from simple correlation theory that the coefficient of regression 
of Y on X is equal to X:zy/3;2. Accordingly, the regression coefficient for 
the total sample is 


br = Bm - E (115) 
2 
ie " 


(In this discussion, since we are not interested in the regression of X on Y, 
it will be understood that any regression coefficient referred to is a regression 
of Y on X.) 

Similarly, the regression coefficient for within-groups is 


by = Pe. (116) 


If we now let Y' = Y — b,z represent what we have called an adjusted 
criterion score, we may note that the mean of these adjusted scores for the 
total sample will be y 

ÈE- ber) 
My = SEN = My 


and that the deviation of a single adjusted criterion score from the general 
mean of adjusted criterion scores is 


y = (Y — bet) — My = y — bua. 


322 ANALYSIS OF COVARIANCE 


The deviation of a group mean of adjusted scores, or the adjusted mean of 
the group, from the general mean, is given (for group i) by 


Lc bw i, = E 
8- ily: = bet) _ Ji — boss, 
ni 
in which a; is the deviation of a measure in group i from the general X-mean 


and similarly for y;. 
The deviation of a single adjusted score in group i from the adjusted mean 


of the group is given by 
yt — di = (yi — bor) — (Gi — but?) 
= (yi — Ji) — bolz: — Zi). 


For any one group, the sum of the squared deviations of the adjusted scores 
from the adjusted mean for the group would be 


Xl — 92 — belz: — 2T. 


By summing these expressions over all a groups, we get the sum of the 
squared deviations of all of the adjusted measures from their respective group 
means, which we will call the “adjusted” within-groups sums of squares, 
(ss; as follows: 


ssir = El- 92 ~ bales - 27 

a " ELl: — gd) — n} 
LL-H) = 1 a oA 
s LEG: = iy 


(spo) 
EMI (117) 


SSwy — 


'The degrees of freedom for this "adjusted" sum of squares for within- 
groups is not the same as that for the sum of squares for within-groups for 
the unadjusted Y measures, since one degree of freedom has been lost by 
imposing the linear restriction that the deviations be computed from the 
within-groups regression line. Accordingly, the degrees of freedom for the ad- 
justed within-groups sum of squares is N — a — 1, and the adjusted mean 
square for within-groups is 

ms, = ssyy/N — a — 1. 


The next step is to compute an adjusted sum of squares for total. This 
is the sum of the squared deviations of all of the measures from the regression 
line fitted to the total sample and is computed as follows: 


ss), = ss 
Ty D MON (118) 


Xx 
The derivation of (118) is similar to that for (117). 


BASIC FORMULAS 323 


An adjusted sum of squares for between-groups is next found by subtracting 
the adjusted sum of squares for within-groups from that for total as follows: 


S84 = 88r,, — shor. (119) 


This is contrary to the usual practice of securing the sum of squares for 
within-groups as a residual. A different adjusted sum of squares for between- 
groups could be directly computed as È (J: — bwt.)", using the within-groups 
regression coefficient. However, an adjusted sum of squares for between- 
groups thus computed would be inflated by sampling error in the estimate 
(by) of the regression coefficient employed, and would make the between- 
groups effect appear more significant than it really is. What we have called 
the adjusted sum of squares for between-groups is sometimes called the “‘re- 
duced" sum of squares for between-groups, since it is not thus inflated. 
Furthermore, if the same regression coefficient were used to compute the 
adjusted sums of squares for between-groups and within-groups, the cor- 
responding mean squares would not be independent, and their ratio would 
not be distributed as F. 

We may now compute an adjusted mean square for between-groups and 
form an F-ratio between the adjusted mean squares for between-groups and 
within-groups, 


F = msi/ms}, df = (a — 1)/(N — a — 1). (120) 


Proof that this ratio is distributed as F will not be given here. The condi- 
tions under which it is distributed as F are those numbered 3 to 8 in the 
list below. "The additional conditions (1 and 2) are necessary if, in a con- 
trolled experiment, one is safely to conclude from a significant F that the 
experimental treatments have different effects. 


1) The subjects in each treatment group were originally drawn either 
(a) at random from the same parent population, or (b) selected from 
the same parent population on the basis of their X-measures only — the 
selection being random with reference to all other factors for any given 
value of X. 


2) 'The X-measures are unaffected by the treatments. 


3) The criterion measures for each treatment group are a random sample 
from those for a corresponding treatment population. 


4) The regression of Y on X is the same for all treatment populations. 
5) This regression is linear. 


6) The distribution of adjusted scores for each treatment population is 
normal. 


7) These distributions have the same variance. 


8) The mean of the adjusted scores is the same for all treatment popula- 
tions. 


324 ANALYSIS OF COVARIANCE 


Condition 3 is contained in Condition 1, and is that part of Condition 1 
which is necessary to an F-distribution of ms4/ms;. 

Since, under Conditions 3 to 8, the population means of the adjusted scores 
would be identical only if the line of regression of Y on X is the same for al 
populations, Condition 8 could also be stated: “The regression lines of Y on X 
for the various populations coincide." When all of Conditions 3 to 8 are 
satisfied, the relation among the X—Y scatterplots for the populations may 
be roughly pictured as in Figure 6. 


a 
CE gae 


Ficure 6. Relalion among Ficure 7. Relation of Figure 8. Relation among 

XY scallerplols of popu- scallerplols when only scallerplols when only 

lations under Condilions Condilion 8 is not Conditions 4 and 8 are 
3108 salisfied not satisfied 


The straight line in Figure 6 represents the common line of regression of 
Y on X for all treatment populations. A significant F in an experiment, 
granting Conditions 3 to 7, would suggest that the relationship among the 
population scatterplots is as pictured in Figure 7. If neither Condition 4 
nor Condition 8 is satisfied, but Conditions 3, 5, 6, and 7 are met, the rela- 
tionship might be as pictured in Figure 8. 

If we regard Conditions 3 to 7 as the assumptions underlying the test, the 
F-ratio of (120) provides a test of the hypothesis that the population means 
of the adjusted scores are identical, or that the lines of regression of Y on .X 
coincide. Whether or not the equality (or inequality) of the adjusted means 
implies equality (or inequality) of treatment effects, however, depends upon 
how the treatment groups were selected. "They might have been so selected 
that, due to differences existing before administration of the treatments, the 
adjusted criterion means would differ at the close of the experiment even 
though all groups had received the same experimental treatment. Again, 
there might have been initial differences among the groups that just counter- 
balanced the differences among the experimental treatments, so that at the 
close of the experiment, but not before, the adjusted criterion means are equal. 
Before we may infer from a significant F that the treatments differ, then, we 
must be able to say that had all groups received the same treatment (or 
equally effective treatments) the adjusted criterion means would have been 


AN ILLUSTRATIVE EXAMPLE 325 


the same (except for chance), or the lines of regression of Y on X would 
have coincided. 

We can say this with complete confidence, of course, if all treatment 
groups have been originally drawn at random from the same population, 
granting that all extraneous factors have been experimentally controlled or 
equalized during the course of the experiment. We can also say it with con- 
fidence if we know that all treatment groups were originally selected from the 
same population on the basis of their X-measures only, but essentially at 
random with reference to any other factors. For instance, one treatment 
group might have been selected from the upper end of the X-distribution for 
the parent population, but so that for any selected value of X all individuals 
with that value of X in the population have an equal chance of being drawn. 
The other treatment group might be similarly drawn from the lower end of the 
_\-distribution, etc. We could, then be sure that if all treatment groups are 
given the same treatments, the line of regression of Y on X would be the 
same for all treatment groups. If, however, the treatment groups are selected 
on any other basis, so that they exhibit systematic differences other than those 
resulting from the X-selection, then these differences, rather than differences 
in the experimental treatments, could account for the final equality (or in- 
equality) of the adjusted criterion means. It is possible that with groups 
otherwise selected the lines of regression of Y on X will coincide if all groups 
are treated alike, but the methods of selection suggested in Condition 1 
seem to be the only operational procedures for insuring that the lines would 
coincide for the same treatment. The practical significance of Condition 1 
will be more fully considered later. 

If Condition 1 is satisfied, then, of course, Condition 3 is satisfied also. 
Obviously, however, Condition 3 could be satisfied even though Condition 1 
were not. 

It should be emphasized that these conditions apply to a controlled experi- 
ment. Presumably there are no systematic differences among the treatment 
groups in the effects of any extraneous factor. 

It should be noted that the method of analysis of covariance is worth while 
(assuming a correlation between X and Y) even though the X means are 
identical for all treatment groups, in which case no adjustments would need 
to be made in the Y means. Nevertheless, assuming some correlation between 
X and Y, the within-groups variance of the adjusted measures would be less 
than that of the unadjusted measures, so that the precision of the experiment 
would be increased. 


An Illustrative Example 


Suppose that a learning experiment involving three treatments has been 
performed with 12 subjects, four for each treatment, the subjects having been 
assigned at random to the treatments. Let Y represent the criterion score, 
and let X represent a measure of aptitude for learning, to be used as the 


326 ANALYSIS OF COVARIANCE 


control variable. Let Ai, A», and A; represent the treatments. The scores 
for the 12 subjects are given below. 


Control and Crilerion Scores in a Hypothelical Experiment 


Ay As As 
e A X ME X Y Mx-300 
33 18 34 31 34 15 
42 34 55 45 4 8 My = 22.0 
40 22 9 1 12 18 
31 24 50 33 16 15 
Means 36.5 24.5 37.0 27.5 16.5 14.0 


The steps in the computation are as follows: 


1) Compute the sum of squares for treatments (A), within-groups (w) and 
total (T), for X and Y separately, and then compute the corresponding 
sum of products. 


2) Compute the adjusted sums of squares for within-groups and total 
by means of (117) and (118), respectively. Then secure the adjusted 
sum of squares for treatments (A) as a residual, by means of (119). 


3) Compute the adjusted mean squares for treatments and within-groups 
and form the F-ratio between them to test the significance of the differ- 
ences among the treatment means. 


The results of these computations are summarized in the table below. 
The student is advised to check all computations as an exercise in order to 
familiarize himself with the procedures. 


Sources df En sp 8$, ss, df ms, 
A 2 1094.0 651.0 402.0 19.6 2 9.8 
w 9 1854.0 1261.0 1244.0 386.3 8 48.29 


Total 11 2948.0 1912.0 1646.0 405.9 10 
F — 9.8/48.29 — .24- 


Since the F is less than 1, it is obvious that the treatment differences are 
not significant. 

It will be noted that the error variance for the unadjusted criterion scores 
is 1244/9 = 138.2. Accordingly, through an analysis of covariance, the error 
variance has been reduced from 138.2 to 48.29, or the precision of the experi- 
ment has been almost tripled.! 


! The ratio 138.2 to 48.29 overestimates slightly the gain in precision since it ignores 
the sampling error in b. A method of making allowance for this sampling error is given 
in Cochran and Cox, Experimental Designs, pages 81-82. 


AN ILLUSTRATIVE EXAMPLE 327 


The extent to which the use of the methods of analysis of covariance in- 
creases the precision of an experiment of this type depends upon the within- 
groups correlation between the criterion and control variables. "The ratio 
between the adjusted error variance and the unadjusted error variance is very 
nearly equal to (1 — 72). In the illustration here used, the within-groups 
correlation is .83, and the ratio between the adjusted and unadjusted error 
variances is .35, which is very nearly equal to (1 — .83?). The correlation 
of .83 found in this example is, of course, higher than would be found in 
most actual experiments, but very often the correlation is high enough to 
increase the precision of the experiment very substantially. 

1f the overall F proves significant, one will wish to compute the adjusted 
treatment means in order to be able to test differences for individual pairs 
of treatments. To do this, in this example, we must first find the value of 
the within-groups regression coefficient, which according to (116) is .6801. 
Accordingly, the adjusted criterion mean for A; is 24.5 — .6801(36.5 — 30.0) — 
20.08, for As is 22.74, and for A; is 23.17. The corresponding unadjusted 
criterion means are 24.5, 27.5, and 14.0, respectively. Thus we see that the 
differences among the treatment means for adjusted scores are very much 
less than for the unadjusted scores. We know then that the differences 
among the treatment means of unadjusted criterion scores is very largely 
accounted for by chance differences in the learning ability of the subjects. 
We note particularly how much the unadjusted mean for A; was lowered by 
the low aptitude of the subjects in the A; group. After adjustment, the As 
mean is higher than the others, where before it had been much lower. If 
the differences among the treatment means had proven significant, we might 
have wished to test the significance of the difference in a particular pair of 
adjusted treatment means. The error variance of the difference between 
two adjusted criterion means (Y; — Y;) is given by 


Mr MT 
oo. IE Panos a. Joe 
Lgs f 


n nj SSwx 


in which the adjusted sums of squares may be computed either from the 
entire experiment or only from the data for A; and A; alone. (Proof of this 


formula will not be given here.) 
In the example, the error variance of the difference in the adjusted means 


for A; and As is 


2 
LA - E ah it fos. nU" Lug = 24.14 
and hence, this difference is tested by 
. 20.8 — 22.74 1.66 
V | 49 


t ee — 


V 24.14 zo 


which, in this case, of course is not significant. 


328 ANALYSIS OF COVARIANCE 


Importance of the Assumptions Underlying the Test of Significance 


of the Treatments Effect 


Judging by past applications of the method of analysis of covariance in 
educational and psychological research, the assumptions underlying the test 
of the hypothesis of equal treatment effects are, in general, in greater need of 
critical attention than is true with most, if not all, of the designs previously 
considered. Generally the method has been employed with little regard to 
the conditions under which the test is valid, and instances are numerous in 
which one or more of the conditions have clearly not been satisfied. 

The first condition, concerning the manner of selection of the treatment 
groups, has perhaps most often been violated with serious consequences. In 
a few applications of the method, the treatment groups have been originally 
drawn at random from the same parent population. In these applications the 
method has been used to make adjustments only for chance differences in the 
control variable, and the chief advantage gained through the use of the 
method has been that of increased precision in the treatment comparisons. 
However, in most applications of analysis of covariance in educational and 
psychological research — particularly in the former field — the method has 
apparently been used in an effort to correct or to make adjustments for 
syslemalic differences existing among the treatment groups before adminis- 
tration of the treatments, and only rarely have the treatment groups been 
selected with reference only to the control variable used in the analysis. 
Many experimenters seem to have assumed that in a single-classification 
experiment the method of analysis of covariance with a control variable X 
is always the equivalent of a treatments X levels experiment with the same 
control variable, regardless of the manner in which the treatment groups may 
have been selected. "That is, they seem to have assumed that the method 
eliminates the effects of any systematic differences that may have existed 
originally among the treatment groups, even though some of these differences 
may be quite independent of the .X variable employed. It should be noted 
that the treatments X levels design does eliminate not only chance differ- 
ences in the control variable, but also initial systematic differences in any 
other factor, due to the random assignment of subjects to treatments within 
levels. This is true of the method of analysis of covariance, however, only if 
Condition 1 is satisfied. 

Consider, for example, the use of the method in an experimental compari- 
son of three ways of teaching fourth grade arithmetic. Suppose that the ex- 
periment is performed in a school in which there are available three fourth- 
grade classes which were organized in the usual way at the beginning of the 
school year, with no knowledge that an experiment was later to be performed, 
The experiment is performed during the first six weeks of the second semester, 
each experimental instructional method being used with a different class. A 
test of general intelligence is administered to all students at the beginning 


IMPORTANCE OF UNDERLYING ASSUMPTIONS 329 


of the experiment and the score on this test constitutes the control variable 
(X) used in an application of the method of analysis of covariance. A signifi- 
cant difference is found in the F of (120), and the conclusion is drawn that 
the instructional methods differ in effectiveness. Many applications of this 
general type have been reported in the literature of educational research. 

How valid this procedure is depends on the history of the classes up to the 
time of the beginning of the experiment, as well as upon the adequacy of 
control of extraneous factors during the course of the experiment. Suppose, 
on the one hand, that all classes had originally been organized on essentially 
a random basis, so that at the beginning of the school year there were no 
differences among the classes larger than could readily be attributed to random 
sampling. Suppose also that throughout the first semester all classes had 
essentially the same educational experiences — all classes had been taught by 
the same teacher and received the same assignments, etc. — so that there was 
no apparent reason to believe that systematic differences among the classes 
had been created since the beginning of the first semester. In that case, at 
the time of the beginning of the experiment the classes might still be reason- 
ably regarded as randomly selected from a single population, and the use of 
the method of analysis of covariance is valid (granting Conditions 4 to 8). 

Suppose, on the other hand, that throughout the first semester the classes 
had had different arithmetic teachers, who had not only differed in personal 
effectiveness but also had used somewhat different methods of teaching arith- 
metic. Suppose the teacher of the class that was later to use experimental 
Method A used a method much like Method A, so that when the experiment 
began the pupils were able at once to use the experimental method with near 
maximum effectiveness. Suppose, however, that the teacher of the class 
that was later to use Method B had used a method which conflicted with 
Method B, so that considerable time was required early in the experiment be- 
fore the pupils were able to use this method effectively. In this case, no 
“adjustments” based on initial intelligence test scores, or even on initial 
arithmetic achievement test scores, could possibly account for the effects of 
these differences upon the final adjusted means of the treatment groups. 
In any such application it would be dangerous, to say the least, to infer from 
a significant F [of (120)] that the differences among the adjusted treatment 
means are due to the treatments themselves. 

The situation is much worse, of course, if the classes were originally selected 
not at random but so as to differ markedly with reference to some trait or 
characteristic related to the criterion variable in the experiment. Suppose, 
for example, that the classes had been selected according to ability and interest, 
that the abler and more industrious students had been assigned to one class 
and the least able to another, and that appropriate modifications in instruc- 
tion had been used with these classes during the first semester. Suppose then 
that an initial achievement test administered at the beginning of the experi- 
ment provided the X-measures used in the analysis of covariance. In this 
case, not only would Assumption 1 be invalid, but differences in regression 


330 ANALYSIS OF COVARIANCE 


(Assumption 4) and in variability of adjusted scores (Assumption 7) or even 
differences in the nature of the regression (Assumption 5) might well be ex- 
pected. Nevertheless, many applications of this type also may be found 
reported in the research literature. 

Whenever the X-measures are obtained during the course of or after the 
close of an experiment, careful consideration should be given to Condition 2. 
If the X-measures are taken at the beginning of the experiment or before, they 
could obviously not be affected by the treatments no matter what X may 
represent. If they are taken during the course of or after the conclusion of the 
experiment, they may or may not be affected by the treatments. For example, 
if X is a measure of chronological age, it clearly cannot be affected by the 
treatments no matter when the X observations are made. However, if the 
experiment is a learning experiment and X is the score on an intelligence 
test administered at the close of the experiment, it is readily conceivable that 
the X means could be affected by the treatments. If the latter is the case, 
then in "taking out" the effects of X, we would be taking out part of the 
treatments effect itself. It is sometimes useful to make such an analysis, 
but we must be careful in this case not to regard the differences among the 
adjusted criterion means as measures of **the" treatment effect. 

Condition 2 has caused little trouble in past applications of the method 
of analysis of covariance in education and psychology, since in nearly all 
applications the X measures have been obtained before administration of the 
treatments. The consequences of a failure to satisfy Condition 2 will be 
further considered later in the section on “Analysis of Covariance as a Means 
of Introducing an Additional Factor into a Factorial Experiment.” 

Of the remaining assumptions, perhaps the most critical in practice is the 
assumption (Condition 4) that the regression of Y on X is the same for all 
treatment populations. Decisions concerning the validity of the other as- 
sumptions — linearity of regression, normality of distribution, and homo- 
geneity of variance — must generally represent judgments based on a priori 
considerations like those discussed in earlier chapters, since available statistical 
tests of the validity of these assumptions are both low in power and difficult 
to apply. A statistical test of homogeneity of regression, however, is readily 
available and is described in the following section. 


Test of Homogeneity of Regression 


The adjusted sum of squares for any one group is the sum of the squared 
deviations from the regression line for that group based on the common within- 
groups regression coefficient. The sum of these sums of squares for all groups 
is the adjusted sum of squares (ss, ,). This adjusted sum of squares may be 
analyzed into two components, one of which is the sum of the squared devia- 
tions of the measures each from the regression line for its own group (based 
on the regression coefficient for that group only), the other of which is due to 
differences among these group regression lines. If it can be shown that the 


TEST OF HOMOGENEITY OF REGRESSION 331 


latter component is significantly larger than the first, we must conclude that 
there are real differences among the group regressions, 

For Group i alone, the sum of the squared deviations from the regression 
line for that group only is 


it. (Zey)? 
25i Nim 
with (n; — 2) degrees of freedom. For the illustrative exercise, for example, 
this sum of squares for Group A; is equal to 82.99. 
Summing these expressions for all a groups, we get 


AE ERE PESE 
SSdev. fr. grp. regr. LXX Ev. 


with Uns — 2) = (N — 2a) degrees of freedom. Since the first right-hand 


term is already known (ss,,), we need only compute eee ay * For the 


illustrative example, the sum of squares for deviation fom group regression 
is 204.7. The sum of squares for differences among group regression lines 
is obtained by 


em) 
SSamong grp. regr. = SSwy — SSdev. fr. grp. regr. 


with (IN — a — 1) — (N — 2a) = a — 1 degrees of freedom. For the illustra- 
tive example, this result is 386.3 — 204.7 — 181.6. 

It may be shown that on Conditions 3, 5, 6, and 7 the ratio MSemong grp. reg /. 
MSdev. fr. grp. reg, 18. distributed as F, with (a — 1) and (N — 2a) degrees of 
freedom. Accordingly, this F may be used to test the hypothesis that there 
are no differences among group regressions. For the example, this F is 2.66 
whereas the 10% point in the /-distribution for 2 and 6 degrees of freedom is 
3.46. Hence, the assumption of homogeneous regression is clearly tenable. 

Granting that Conditions 1 and 2 have been met in a controlled experiment, 
to say that the regression of Y on X is heterogeneous (but linear) is to.say 
that there is a “treatments effect" but that the relative effectiveness of the 
treatments differs for different values of X. There may be some value of X for 
which the treatments are equally effective. If so, for higher values of X a 
certain treatment may be superior to a certain other treatment, but below this 
value the reverse would be true. There is a way of testing the hypothesis, for 
any given value of X, that the Y means in the populations are identical. 
However, we are rarely interested, in educational and psychological experi- 
ments, in the relative effectiveness of the treatments for any particular single 
value of X; rather we are interested in the relative effectiveness of the treat- 
ments for populations that are variable with respect to X. Accordingly, it 
hardly seems worth while to describe this procedure here." 


1 See Alexander Mood, Introduction to the Theory of Statistics (New York: McGraw- 
Hill Book Company, 1950), pp. 350-357, and M. G. Kendall, The Advanced Theory of 
Statistics, Volume II, Third Edition (London: Charles Griffin and Company, Ltd., 
1951), pp. 237-245, for more advanced discussions of analysis of covariance. 


332 ANALYSIS OF COVARIANCE 


Generalized Procedure 


We are now ready to generalize the method of analysis of covariance for 
application in any of the experimental designs that have heretofore been con- 
sidered, granting that for each subject there are available measures of two 
related variables, one of which is to be used as a control and the other as a 
crilerion variable. 

As has already been noted, it is always possible to analyze the total sum 
of products into components corresponding exactly to those into which the 
total sum of squares for either the criterion or the control variable may be 
analyzed. This aspect of the analysis, then, should present no problem. 

We have seen also that in some of the more complex designs it may be 
desirable to test the significance of each of a number of different effects, each 
of which may involve the employment of a different error term. (For an 
example, refer to the discussion of the A x B X R design, pages 230-237.) 
For any of these several separate tests of significance, the procedure involved 
in the application of the method of analysis of covariance may be described 
in the same general terms. 

Let U represent the "effect" the significance of which is to be tested. 
The “effect” may be the main effect of a treatment, the simple effect of a 
treatment, a first or higher order interaction, or an interaction of two factors 
for a given level of a third, and so forth. Let E represent the appropriate 
error term for U. The error term may be the within-cells component or it 
may be an interaction of any order, or it may be derived by pooling the 
sums of squares for a number of interactions. For example, in a four-factor 
(A X B x C X D) design, U may represent AB and E may represent a pool- 
ing of all the higher order interactions. In these terms, the generalized pro- 
cedure for testing the significance of U is as follows: 


1) Compute the sums of squares for U and E for both X and Y (the control 
and criterion variables) separately, and then compute the corresponding 
sums of products. 


2) Add the sums of squares for U and E for the X-measures to compute 
the sum of squares for U + E. Similarly, compute the sum of products 
for U+ E and the sum of squares for U + E for the criterion variable. 


3) Compute the adjusted sum of squares for E and U + E respectively, 
substituting in (117) and (118) the corresponding sums of squares for 
the control or criterion variables, and the corresponding sums of products. 


4) Subtract the adjusted sum of squares for E from that for U + E to 
secure the adjusted sum of squares for U. 


5) Divide the adjusted sums of squares for U and E each by their respective 
degrees of freedom to secure the corresponding adjusted mean squares, 


INCREASING THE PRECISION OF AN EXPERIMENT 333 


noting that the degrees of freedom for the adjusted mean square for E 
is one less than for the corresponding mean square for the unadjusted 
criterion scores, due to the use of the regression coefficient. 


6) Form the F-ratio between the adjusted mean squares for U and E to 
test the significance of the U effect. 


It will be observed that where a number of different tests of significance 
must be made in the same design, the computation in an analysis of co- 
variance may become rather tedious, but if the correlation between the con- 
trol and the criterion variables is substantial, the increased precision may 
be enough to justify the additional computational labor involved. 


Analysis of Covariance vs. the Treatments x Levels Design as a Means 


of Increasing the Precision of an Experiment 


The relative importance of the assumptions underlying the test of signifi- 
cance may be further clarified by directly contrasting the method of analysis 
of covariance with the use of the treatments X levels design. These may be 
regarded as alternative ways of increasing the precision of the experiment 
through the control of a concomitant yariable — the one employing a sta- 
tistical and the other an experimental control. When both methods are 
available, the use of the treatments X levels design is generally to be recom- 
mended. In this situation, the method of analysis of covariance offers certain 
administrative advantages or conveniences, but these are generally of rela- 
tively minor importance. For example, the use of the method of analysis 
of covariance may simplify the administration of the experiment by avoiding 
the necessity, before the experiment may begin, of constituting the subjects 
into levels with proportional frequencies at each level. 1t does this at the 
cost of some additional computational labor, but this may be negligible in 
relation to the administrative conveniences gained. Furthermore, when the 
method of analysis of covariance is used, the control measures may be se- 
cured at a more convenient time, either during the course of the experiment, 
or even after its conclusion, depending upon the nature of the control variable. 
The treatments X levels design, however, has the very important general 
advantage that it requires much less restrictive assumptions than the method 
of analysis of covariance. The method of analysis of covariance assumes linear 
regression: the test of the treatments effects in the treatments X levels design 
is valid no matter what the nature of the regression, so long as the assump- 
tions of within-cells homogeneity and normality are satisfied. The test of 
the treatments effect in the method of analysis of covariance assumes homo- 
geneous regression for all treatment groups, which is equivalent in a treat- 


1 See Cochran and Cox, Experimental Designs, page 81, for approximate shortcut 
procedures. 


334 ANALYSIS OF COVARIANCE 


ments X levels design to assuming that there is no interaction of treatments 
andlevels. The test of the main effect of treatments in the treatments X levels 
design is valid whether or not an interaction exists. 

There are differences, too, in the kind of information that may be derived 
from the experimental data. A test of interaction may be made in either case 
— as previously noted, with the method of analysis of covariance the test of 
homogeneity of regression is equivalent to the test of interaction in the 
treatments X levels design — but the computational procedures are con- 
siderably simpler with the treatments X levels design. The use of the treat- 
ments X levels design permits a study of the “simple” effects of the treat- 
ments at any given level. (See the reference of the footnote on page 331 for a 
less convenient way of studying simple effects with analysis of covariance.) 

In general, then, it would appear that the method of analysis of covariance 
should be employed only when the use of the treatments X levels design is 
not a practicable alternative, and only when careful consideration indicates 
that the underlying assumptions are at least approximately satisfied. 

It may be worth noting that it is possible to apply both techniques simul- 
taneously — using the method of analysis of covariance for statistical control 
of one concomitant variable in an experiment in which another concomitant 
variable is experimentally controlled through the use of the treatments X levels 


design. 


Analysis of Covariance as a Means of Introducing an Additional Factor 


into a Factorial Experiment 


In most of the preceding discussions, it was implied that the method of 
analysis of covariance was being employed primarily in order to increase the 
precision of the experiment and to adjust for initial differences in .X, and 
not because the relationship of the control and criterion measures was of any 
interest in itself. In many instances, however, the X-factor may be intro- 
duced for exactly the same reasons that any other “factor” is introduced in 
a factorial experiment — that is, in order to study its relationship to the 
other factors or the manner in which it may affect the comparisons within 
any of the other classifications. 

For example, in an experiment concerned with methods of instruction of a 
school subject, it may be suspected that certain of the methods may motivate 
the pupils to spend more time in study out of class than others. The experi- 
menter may accordingly wish to know which method would have resulted in 
highest achievement had the pupils spent the same total time in study under 
each method or what part of the effect of each method is a direct and what 
part is an indirect effect brought about through the increase in study time. 
Suppose that a record was kept during the experiment of the amount of 
study time for each pupil, and that from this record the total time for each 
pupil was determined. By the method of analysis of covariance, the mean 


CONTROL OF MORE THAN ONE VARIABLE 335 


Scores on the criterion achievement tests could then be adjusted so as to 
eliminate the effect of the time differences. 


Statistical Control of More than One Concomitant Variable 


1f it is desired to control statistically the effect of more than one concomi- 
tant variable, the adjustments must be made by means of the multiple regres- 
sion equation between the criterion and the concomitant variables. The 
regression coefficient can be computed as before from the error terms secured 
through analyses of the variances and covariances of the variables involved. 

In the case where allowances are to be made for two initial measures (X 
and Z), the multiple regression equation will be 


y! = bz bz. 


To compute these regression coefficients, an analysis of variance must be 
carried through for each of the three variables and for the three possible co- 
variances. Having found the error term (sum of squares or products) in each 
of these analyses, the results may be substituted in the following simultaneous 
equations, which may be then solved for b; and by. 


Lay = hs uz 
Ezy = bd b». 
The formula for computing any adjusted score (Y^) will then be 
Y'= Y — bz — bz. 


The total sum of squares for the adjusted scores will be 


Eya = X(y- bie — bz) 
= Yy! — Lay — WY zy + BLE + bb Dre + GET. 


Each of the components of the adjusted total sum of squares may be com- 
puted by the same formula from the corresponding components of the sums 
of squares and products for the three variables involved. The error variance 
for adjusted scores will then be computed as before, after having allowed 
for the two degrees of freedom utilized in computing regression coefficients. 
The variance for treatments would then be computed as a residual in a 
manner similar to that already described. 

Similar methods could be employed to allow for still other concomitant 
measures, but obviously with a tremendous increase in the amount of labor 
involved. The computational task for two control measures is not at all 
unmanageable, and may sometimes be worth while, considering the ease with 
which additional measures may sometimes be secured. The advantage gained 
depends upon the magnitude of multiple correlation coefficient for the con- 
templated number of variables as compared with that for the best combination 


336 ANALYSIS OF COVARIANCE 


of any smaller number. Experience with educational tests has shown that in 
situations of this kind the multiple correlation of two measures with the 
criterion will seldom be very much higher than the higher of the two zero 
order correlations, and that usually only a negligible increase in the multiple 
correlation is secured by adding a third dependent variable (assuming, of 
course, that the two already selected are the best two for the purpose). It is 
hardly worth while, therefore, to attempt here a description of the more 
complex procedures required for three or more concomitant measures. 

For the case of the two concomitant measures already considered, it may 
be worth pointing out that the multiple correlation R,.., between the initial 
concomitant measures and the criterion may be computed from the formula 

Re, = bday + bly, 
ys — 
xy 
How much the labor of allowing for both variables is worth while is then 
dependent upon how much 


Milenio 


is less than either 


VEER Or Vv Lae 


STUDY EXERCISES! 


1. In a simple-randomized experiment, measures of a related variable (X), 
as well as of the criterion variable (Y), were obtained for all subjects. These 
measures are given in the table below: 


pees, E nes An DL SN 
xe y Z y Y y oy 
97 23 9 2 96 20 89 9 
106 19 85 15 127 36 95 15 
105 23 86 25 107 24 91 9 
76 23 107 30 105 23 l0 22 
128 33 ns 22 106 28 96 20 
107 24 83 15 106 22 107 21 
103 10 120 30 98 14 109 21 
95 23 12 23 106 4 117 20 
104 27 14 24 17 12 79 8 
109 28 109 18 126 20 86 21 
10 27 T 7 
109 18 103 23 
129 33 
94 18 


1See second paragraph on page viii. 


STUDY EXERCISES 337 


a) Prepare a table (similar to that on page 326) summarizing the results of 
an analysis of variance and covariance of these data and giving the F 
needed to test the treatments effect. Give the degrees of freedom for 
this F. 


b) Compute the error mean square obtained in an analysis of the variance 
of the Y measures only. Express the error mean square of (a) as a per- 
cent of this mean square. What is the meaning of this ratio? 


c) Compute the estimated r,,(,). Compute [1—7*,()]. Compare with 
the result of (b). 


d) What is the estimated coefficient of regression of Y on X for the Ai 
group alone? 


e) Test the hypothesis that the within-group regressions are homogeneous. 


f) Compute the / needed to test the hypothesis that the mean of the A; and 
Az treatment populations are identical. What is the number of degrees 


of freedom for this £? 


2. An experiment was performed by Kruglak ! to determine the relative 
effectiveness of two laboratory procedures in teaching elementary college 
physics: the "conventional" (or control) method, in which the students 
worked in pairs and performed each experiment by following a manual, and 
the experimental method, in which the instructor performed the experiment 
while the students observed and recorded results. 

Subjects for the experiment were selected from 194 college students regis- 
tered for physics “lab” at the University of Minnesota. Subjects included in 
the sample were male, Minnesota high school graduates who could take lab on 
Tuesdays. Four laboratory periods were held on Tuesday, two morning 
(8-10 and 10-12) and two afternoon (1-3 and 3-5) periods. Two sections 
were scheduled during each period. Subjects who could take lab in a par- 
ticular period were assigned at random to one of the two sections for that 
period. Different numbers of students were available at each time period. 
In order to facilitate statistical analysis, cases were rejected at random until 
each subgroup was equal in size to the smallest one. 'This resulted in 56 
students for the analysis, 7 in each section. 

Four instructors were scheduled to teach lab sections on Tuesdays. Each 
instructor had a morning and an afternoon section. The two instructors 
scheduled at each period were assigned at random to either the control or 
experimental section — each taught the opposite section (method) at the 
other time of day. Thus each instructor taught a control section and an 


experimental section. 


1 Hayn Kruglak, “A Comparison of the Conventional and Demonstration Methods 
in the Elementary College Physics Laboratory,” Journal of Experimental Educalion, 
vol. 20 (March, 1952), pp. 293-300. 


338 ANALYSIS OF COVARIANCE 


A 30-item pre-test of laboratory practice was administered to all subjects 
at the beginning of the quarter. The scores on this test were used as control 
measures. The criterion measure for each subject was his score on the same 
test administered at the close of the quarter during which the experiment 
was in progress. A summary of sums of squares and products is presented 
below. 


Control Criterion 


of 88x SPxY ssy 
Methods (A) Ẹ 23.14 371.57 5,965.79 
Instructors (J) 3 154.07 123.43 1,140.00 
Interaction (AJ) 3 274.14 121.57 2,332.21 
Within 48 1,836.00 1,554.86 24,685.71 
Total 55 8,287.35 8,171.43 34,123.71 


a) Using the “within” term as “error,” compute the adjusted (reduced) 
sums of squares and mean squares for A, I, and AI, as well as the ad- 
justed sum of squares and mean square for “within,” and present these 
in a summary table. 


b) The sum of (3:2y)?/322? for the eight groups is 8185.0,! a and y repre- 
senting deviations from the group means. Test (at the 5% level) the 
hypothesis that the regression is homogeneous for the eight AJ popula- 
tions. 


c) What is your estimate of the (presumably common) correlation between 
initial and final measures within individual sections? Is this correla- 
tion high enough to make the use of the method of analysis of covariance 
worth while? Justify your answer by a comparison of the unadjusted 
and the adjusted mean squares for “within.” 


d) Test, at the 1% level, the hypothesis that there is no AJ interaction in 
the population. Specify the population. 


e) Represent the design as a 2 X 4 diagram, in which columns correspond 
to treatments and rows to instructors, with “a.m.” or “p.m.” in each 
cell to indicate the time of day involved. Suppose that there is a general 
tendency for the morning sections to do better than the afternoon sec- 
tions, either because of a selection of students or because the morning 
is more favorable to high achievement. How does this “ time-of-day " 
factor affect the A-effect in this design? .. . the I-effect? .. . the AI- 
effect? In other words, with what is the time-of-day factor confounded? 
Since the “within” term is being used as "error," is this confounding 
desirable or undesirable? Explain. 


1 This figure is a guess; Kruglak did not report the data needed to compute this term. 


f) 


h) 


i) 


3 


k) 


STUDY EXERCISES 339 


Considering your reasoning in (e), is it plausible that the AI inter- 
action is partly extrinsic? Is it possible that a substantial intrinsic 
interaction was accidentally cancelled by the extrinsic interaction in 
this experiment? Explain. Do you feel safe in concluding from the test 
of (c) that there is no intrinsic interaction in the population? 


Test, at the 1% level, the hypothesis that the treatment means in the 
population are identical. Specify the population, on the assumptions 
(1) that there is no AJ interaction and (2) that there is an A7 interaction. 


Suppose that the four instructors involved in this experiment may be 
regarded as a simple random sample from a meaningful population of 
instructors. Suppose also that there is an intrinsic A7 interaction 
but no extrinsic interaction. How then would you test the A-effect in 
this experiment? Why? For the data here reported, compute and report 
the necessary adjusted mean squares and their degrees of freedom, and 
report also the F-ratio on which the test is based. On the basis of this 
test, is the observed A-effect significant at the 1% level? 


Why was the supposition necessary in (h) preceding that there is no 
extrinsic interaction? Had the sections been assigned to instructors 
wholly at random within each treatment, without regard to time of day, 
would this supposition be necessary and would the test of (h) be valid? 


Explain. 

Was there any real need to make the number of subjects the same in 
each cell (AJ combination? How could the experiment have been 
improved in this respect? 

What are the obstacles to the use of the treatments X levels design in 
this experiment? 


In his report of this study, Kruglak did not report any means, nor any 
sums of squares or products for individual groups. Comment on this 
omission from the point of view of good practice in general in research 


reporting. 


[5 


Tests Concerned With Trends 


Introduction 


Treatment-classifications in experimental designs are of two major types. 
One is the type in which the various treatments represent different amounts, 
or durations, or intensities, etc., of a single common experimental factor, and 
in which the treatments therefore are clearly ordered. For example, the 
treatments in a certain classification may represent different amounts of 
practice in the same task, or different degrees of intensity of the same visual 
stimulus, or different lengths of time in which forgetting may take place, or 
increasing numbers of trials or attempts to perform a certain task, etc. The 
other type of treatment-classification is that in which the various treatments 
are essentially unordered, and in which their distinguishing characteristics or 
differences are described in categorical and usually in qualitative rather than 
in quantitative terms. For example, the treatments may represent complex 
methods of teaching a school subject, or different kinds of interpolated activ- 
ity in a learning series, or different kinds of situations in which stuttering 
may take place, or different ways of distributing practice in the same task, etc. 

In experiments concerned with the latter of these two types, the interest 
is usually in null hypotheses, that is, in hypotheses that the treatments or 
treatment-combinations have the same effect on the means of the populations 
involved. In experiments concerned with ordered treatments, however, the 
interest may be primarily in how the population means of the criterion variable 
change with changes in the experimental factor, rather than in whether or not 
the criterion means do change at all. The hypothesis to be tested in such 
experiments, stated in general terms, may be any of the following: 


Hı: The treatment means are unaffected by changes in the experimental 
variable, that is, the experimental data reveal no frend at all. 


Hs: The changes in the treatment means are directly proportional to the 
changes in the experimental variable, that is, the trend is linear. 
340 


THE SIMPLE-RANDOMIZED DESIGN 341 


Hs: The population means follow a trend (fall along a curve) established 
on an a priori basis without reference to the experimental results. 


Ha The population means follow a trend derived from the experimental 
data, that is, they fall on a curve that has been “fitted” to the observed 


means. 


Experiments may also be concerned with differences among the trends in 
the treatment means observed under different experimental conditions. 
Stated in general terms of factors A and B, the hypothesis to be tested is 


Hi: The trend in the A means (treatments) is the same for all levels of B 
(conditions). 


Experiments concerned with a single trend may employ the simple-random- 
ized, the treatments X subjects, or almost any of the more complex designs. 
Experiments concerned with differences among trends are, of course, always 
factorial in character. 

Experiments thus concerned with the successive changes in the criterion 
variable accompanying experimental variations in a given treatment, and 
experimental comparisons of such trends for different populations or under 
different conditions, constitute a large and important class of psychological 
experiments. The purpose of this chapter is to describe, for various basic 
types of experimental designs, how exact statistical tests may be applied 
to hypotheses of the general types just suggested. 


Tests of Trend in the Simple-Randomized Design 


The simplest design that may be employed in a st udy of trend is the simple- 
randomized design. For example, suppose that several groups of subjects 
were originally selected at random from the same population, and that one 
group was subjected to a certain treatment for one hour, another group was 
given the same treatment for two hours, another for three, etc. For each 
group, measures of a certain trait (Y) were obtained immediately following 
administration of the treatment. Figure 9 represents the possible outcome 
of an experiment of this type, the open dots representing tlie means of the 
criterion variable for the various groups arranged in order of duralion (X) of 
treatments. The problem would, of course, be the same if X represented 
amount, or intensity, or number of repetitions, etc., of the treatment. The 
X-increment or interval need not be uniform. 

Test for Presence of Trend: Usually the first question to be answered in an 
experiment of this kind is— do the data reveal any trend at all? That is, 
are the means consistent with the hypothesis (M) that the treatment means 
are unaffected by differences in X? This of course is simply the null hypothe- 
sis considered in Chapter 3, which is tested by 


F = ms,/ms,, df = (a — 1)/(N — a), (121) 


342 TESTS CONCERNED WITH TRENDS 


1 2 3 4 5 6 7 


FicumE 9. Treatment means for various values of X 


in which ms, is the mean square for treatments and ms, is the mean square for 
within-treatments, in which a represents the number of treatments (or number 
of values of X at which observations are made) and X the total number of 
subjects. 

Especially careful consideration should be given in such experiments to 
the underlying assumption that the population variance (of Y) is the same 
for all values of X. This is sure to be an unsound assumption if the line of 
means begins at, or at any point closely approaches, the base line (Y = 0) 
and if negative values of Y are impossible. In such cases, however, it may 
sometimes be possible to test the hypothesis along only that part of the 
X-scale for which the Y-variance may be considered fairly constant, disregard- 
ing the rest of the data. Sometimes, also, transformations (see pages 88 to 
90) may be employed that will render the criterion measures more nearly 
homogeneous in variability. 

In some situations, there may be strong a priori reason for believing that 
if Y depends at all upon X the relationship is monotonic (see page 99), al- 
though the relationship need not be linear. If this is the case, a more sensitive 
test for the presence of trend is usually a test of the significance of the differ- 
ence in the Y-means for the first and last groups only. That is, a t-test of the 
significance of the difference (M4,— M,,) is much more likely to be signifi- 
cant than the F-test suggested above. If the assumption of homogeneity of 
variance is questionable, the difference in these two means may be tested 
by the Behrens-Fisher test (see pages 96-98) or a modification of it. 

It should be clear from what has just been said that if the only purpose 
of the experiment is to test for presence of trend, and if it is fairly certain that 
the relationship between X and Y, if any, is monotonic, there is no point in 
making observations for more than two values of X. If it is possible, however, 
that Y may increase (or decrease) with increases of X up to a certain point, 
and that beyond that point Y may decrease (or increase) with further in- 
creases in X, observations must be made for a number of values of X in order 
to secure a dependable test for presence of trend. 


THE SIMPLE-RANDOMIZED DESIGN 343 


Test for Linear Trend: If H, proves untenable, it may be desired to test the 
hypothesis (M2) that increases in X are accompanied by proportional changes 
in Y, or that there is a linear relationship between X and Y. The hypothetical 
line of population means in this case is a straight line, but neither its slope 
nor its Y-intercept is specified. 

"This hypothesis is nothing more than the hypothesis of linear regression 
of Y on X, and is tested by the methods of analysis of covariance. To make 
this test, we must compute the sum of the squared deviations (ssx) of the 
X’s from the general X-mean. (This is the “total” sum of squares for the 
X’s, since there is no “within groups" variation in the X’s.) We must also 
analyze the total sum of squares for the Y-distribution into its between-treat- 
ments (ss4,) and within-groups (ss „y) components. Finally, we must compute 
the sum of the ay products (sp) for all subjects. 

We have seen (page 322) that the sum of the squared deviations of the 
Y-measures in the entire sample from the line of regression of Y on X is given 
by 

^ ( 2 G 2 
Et - iot = xy - SAP - ssr, -CR 
in which b is the regression coefficient. 

If the observed means had all fallen exactly on the regression line, then 
the sum of the squared deviations of the individual Y-measures from the 
regression line would be the same as the sum of squares for within-groups 
(ss, y)» that is, ss... would equal Y: (y — bz). We know that ss „p = ssry — S$4y- 
Accordingly, if ss, = L(y — bz)’, it follows that ssa, = (sp)*/ssx. However, 
since all the A-means do not fall on the regression line, 384, will be larger 
than (sp)?/ssx, and the difference between these terms will be indicative of 
the amount of departure from linearity. 

We may think, then, of ssa, as consisting of two components, one of which, 
(sp)?/ssx, is due to linear regression, and the other of which, ssay — (sp)?/ssx is 
due to departure from linearity. The component due to linear regression 
has one degree of freedom; that due to departure from linearity, therefore, 
has one less degree of freedom than ss4,, or (a — 2) degrees of freedom. 
The mean square for departure from linearity is then 


2 
MSaep from lin = D at w/a — 2). 


If the departure from linearity is due only to chance, then the mean square 
derived from the sum of squares due to departure from linearity should be 
the same, except for chance, as the mean square derived from ssy. 

It may be shown that under certain conditions the ratio between these 


mean squares is distributed as F, that is, we may employ the test 
F = m&¢ep son tin/ m$, df = (a — 2)/(N — a), (122) 


to test Hz. MI 
The conditions under which this mean square ratio is distributed as F are 


344 TESTS CONCERNED WITH TRENDS 


1) Each treatment group is a random sample from a corresponding popu- 
lation (for which X is constant at the given value). 


2) The criterion measures are normally distributed for each population. 
3) The criterion measures have the same variance for each population. 


4) The population means are a linear function of X, that is, the regression 
of Y on X for the combined populations is linear. 


The last of these conditions constitutes the hypothesis (Hz) to be tested, 
the other three are the assumptions underlying the test. 

It should be noted that the test of linearity is not very powerful, and can 
be expected to reveal only quite marked curvilinearity unless the sample 
is very large. Even with a large sample, it may fail to identify situations 
in which the regression is linear throughout most of the range of the X and Y 
distributions, but is sharply curvilinear at either end where the frequencies 
are small. 

Goodness of Fil to A Priori Trends: It may sometimes be desired to test a 
completely a priori hypothesis, that is, one which has been specified in advance 
of the experiment without any reference to the experimental data. Such an 
hypothesis would specify the exact values (uj, Les Wis s Ma) of the popula- 
tion means corresponding to the various values of X, or would completely 
define the curve of population means. Figure 10 illustrates such a situation — 
the heavy dots representing the observed treatment means (Mi, Ms, Ms, and 
Mi) in a particular experiment, and the line H representing the hypothetical 
curve of population means. (It does not matter how one arrives at the hypo- 
thetical values, so long as they have not been selected on the basis of an 
inspection of the experimental data. Presumably the hypothesis is based on 
theory only. Actually, there are very few situations in which theory can com- 
pletely describe the line of population means. The testing of goodness of fit 
to a priori trends is therefore of relatively little practical importance in itself, 
but it requires consideration here as a step in the development of a test of 
goodness of fit to fitted trends, which is discussed in the following section.) 

We will let ji, us, . . - Ha represent the actual treatment population means. 
The hypothesis (Hs) to be tested is then that the successive population means 
have the values j,.... His - - - and ya, respectively, or that m = Pay My = hp 
ete. 

It is at once apparent that the hypothetical means might differ from the 
corresponding actual population means by a constant amount, or that the 
line H may be parallel to the line describing the actual population means. 
This possibility has been graphically represented in Figure 10. Thus the 
hypothesis (Hs) might correctly describe the pattern of the actual population 
means, but may be incorrect so far as their vertical placement is concerned. 
We can thus regard the hypothesis (Hs) as consisting of two separate hy- 
potheses, one (Hsc) describing the pattern of the population means, the other 
(Hy) describing their vertical placement. We will frequently be more inter- 
ested in the pattern of the hypothetical means than in their absolute values, 


THE SIMPLE-RANDOMIZED DESIGN 345 


Ficure 10. Observed and hypothesized means 


and hence will wish to be able to make separate tests of these two aspects 
of the original hypothesis. 

We may first note that if Hs. is true, then by subtracting appropriate con- 
stants from the measures in the various treatment populations, we can make 
the actual population means of the “corrected” measures identical for all 


treatment populations. Let 
s - Xu a 

represent the general mean of the hypothetical treatment population means. 
If we could then “correct” each measure in the ith treatment population by 
subtracting (u; — u’) from it, and could similarly correct the measures in 
each of the other treatment populations, the population means of the cor- 
rected measures would all have the same value (although this value would 
not necessarily equal u’, since Hs, may be false). We will let Yi-Yi- 
(ui —u’) represent a single corrected measure in Group i, and will let the 
mean of these corrected measures be represented by Mi-M;- (ui — p’). 
The general mean of the corrected measures would then be the same as the 
general mean of the uncorrected measures (M' = M). Now, to test His, 
we have only to test the null hypothesis as applied to the corrected measures. 


That is, we may test Hs. by 


Xn(MI — M^?/(a — 1) 
Fe (123. 


z EE (i= My/N-a 


346 TESTS CONCERNED WITH TRENDS 


which may also be written 


Ynd(M; — M) — (m — n’)}?/(a — 1) 


E jaf =(a—1)/(N—a). (124) 


LEY — MAN - o) 


The mean square in the numerator of (124) may be called the mean square 
for “departure from pattern” and may be represented by ms, fr patt” The 
mean square in the denominator is, of course, the familiar mean square for 
* within groups," denoted as ms,.o- 

If the pattern aspect of the original hypothesis is true, the corrected meas- 
ures for the various treatment populations have the same mean, so that in 
the experiment — on the assumptions of normality and homogeneity of 
variance — the various treatment. groups may together be regarded as con- 
stituting a simple random sample from a single population. We can then 
test the hypothesis (Ms) — that the general mean of the “ corrected” popu- 


lation means is u’ by means of 
(M'— a^ / ; 
o/N 


hes a wee 
Ex My T ES 
WI vec E 
[4 
which may also be written 
N(M = Y 


F- (125) 


XXxQ(.-MN- a) 


i=l 
We may call the mean square in the numerator of (125) the mean square 
for “vertical placement,” and denote it by ms," 
Since Hy, is of little interest when 7/5, is false, one would ordinarily test 
Hi, first and then test Hs, only if Hs, is tenable. If either Hs. or Hz is 


untenable, of course, //; is untenable also. 
Goodness of Fil to Curves Filled to the Experimental Means: It may some- 


times be desired to test an hypothesis (H4) represented by a curved line 
which has been “fitted” to the experimental means. If the hypothetical 


1 Tt is easy to show that 
SSdep tr H = SSuq + SSdep tr patt F 889) 


in which s$4,,,5 is the sum of the squared deviations of the individual measures from 
their respective hypothetical means. That is, 


a 
SSaep fr H = H 23; (y, — m)? 


Tt may also be shown that the F-tests of (124) and (125) are independent of one another, 
so that either H;, or Hy, may be tested without regard to the truth or falsity of the other. 


TREATMENTS X LEVELS DESIGNS 347 


curve has been fitted! by the method of least squares, the procedure would 
be exactly like that in the test of Hi, in the preceding section, except that 
the degrees of freedom for the numerator of F would be a minus the number 
of constants in the regression formula that were derived from the experi- 
mental means. For example, if ¥ = AX + B had been fitted to the means, 
the degrees of freedom for the numerator of F would be (a—2); if Y = AX*+ 
BX? + CX + D had been fitted, the degrees of freedom would be (a — 4), etc. 
(In the preceding expression, the A representing one of the coefficients in the 
regression equation should not be confused with the A representing the 
treatment classification.) 

In the case of a curve fitted to the experimental data, there would be no 
question of vertical displacement, and hence no need for a test like that for Ha. 

We may now note that if Hs is of the type Y = AX? + BX + C, we may use 
the F of (124) to test the hypothesis that A and B are correct. Then, on the 
assumption that A and B are correct, the F of (125) may be used to test the 
hypothesis that C is correct also. If one wishes to test the hypothesis that 
the trend is of the form Y = AX? + BX + C, without specifying A, B, or C, 
the test of H, would be used. The procedure would be similar if Hs is type 


Y = AB. +C, Y = A log X + C, etc. 


Tests for Trend in Treatments x Levels Designs 


subject in a study of trends than is possible 
with a simple-randomized design, the various treatment groups may be 
matched on the basis of a control variable, using the treatments X levels 
design described in Chapter 5. With this design, the tests of trend are very 
much the same as in a simple-randomized design except for the use of a 
different error term, as is shown in the following paragraphs. 

Tests for Presence of Trend: To test the hypothesis that the criterion means 
are unaffected by differences in X, the data are analyzed in the manner de- 


scribed in Chapter 5, and the hypothesis is tested by 
F = msa/ms», df = (a — 1)/(N — al) (126) 
quare, | is the number of levels, and V 


To secure higher precision per 


in which ms, is the within-cells mean s 


is the total number of subjects. 
The assumptions underlying this test are that each treatment group isa 


sample representative (with reference to levels) of the population to which 
inferences are to be drawn, and that within each level the criterion measures 
are normally distributed with the same variance for each value of X. 

The preceding test is concerned with the hypothesis that the “main” trend 


1 For methods of fitting curved regression lines by the method of least squares, see 
George W. Snedecor, Statistical Methods, Chapter 14 (Curvilinear Regression). The 
methods there described are methods for fitting curved regression lines to individual 


observations, but, of course, are applicable to means as well. 


348 TESTS CONCERNED WITH TRENDS 


is a horizontal straight line (the null hypothesis) — the “main” trend being 
the average of the "simple" trends for the various levels. It is possible that 
the main trend is a horizontal straight line, even though the simple trends are 
not. If desired, one may test the hypothesis that the simple trends differ 
from one another. The procedure for testing this hypothesis will be suggested 
later (page 350). 

Test for Linear Trend: To test the hypothesis that the trend is linear in 
a treatment X levels design, a test similar to that used in the simple-random- 
ized design (page 343) involving analysis of covariance must be employed. In 
addition to the terms needed to test the null hypothesis, the sum of products 
(sp) and the sum of squares for the X-variable (ssx) must be computed. The: 
linear hypothesis may then be tested by 


ss, — GP 
p-— Ur ser, df= -DN -a 07D 


To understand the preceding test, suppose that arithmetic corrections have 
been applied to the original Y-measures so as to eliminate the sums of squares 
for levels and treatments X levels. So far as the corrected measures are con- 
cerned, the design would then reduce to a simple-randomized design, in which 
the linear hypothesis could be tested by means of (122). The test given by 
(127) is essentially that of (122) applied to the corrected data, ss; and ssar 
having been eliminated. 

To test the linear hypothesis for any single level considered alone, the test of 
(122) would be applied to the data for that level only, since so far as any one 
level is concerned, the design is a simple-randomized design. 

Tests for A Priori and Filled Trends: With the treatments X levels design, 
the tests for a priori and for fitted trends are the same as those employed with 
the simple-randomized design, except that the error mean square is that for 
within-cells rather than for within-treatments. 


Tests for Trend in Treatments x Subjects Designs 


Studies of trends will frequently be concerned with the effect of varying 
amounts, or of varying durations, etc., of a treatment upon the same subjects. 
The treatments X subjects design, therefore, is a particularly important design 
for purposes of trend analysis. 

The tests of H;, Hs, and H; are, except for the error term employed, exactly 
the same as in the simple-randomized and treatments X levels design. The 
error mean square in the treatments X subjects design is, of course, the inter- 
action mean square (treatments X subjects). 

With the treatments X subjects design, the linear hypothesis is tested by 


2 
us 
Lee gag if Mesa (128) 


TYPE Il (CONFOUNDED) DESIGNS 349 


for which the degrees of freedom are (a — 2) and (a — 1)(n — 1), in which n is 
the total number of subjects per group. 


Tests of Trend in Type Il (Confounded) Designs 


Tn studies of trend concerned with the effects of increasing amounts, intensi- 
ties, etc., of the experimental factor upon the same subjects, it may be neces- 
sary to use a different form of the criterion test for each value of X. If equiva- 
lent forms of the test are not available, it is possible to counterbalance the 
effects of varying difficulties of the forms by using a Type H mixed design (see 
page 273). Suppose, for example, that the study is concerned with the effect 
of increasing amounts of fatigue upon the time (or number of repetitions) 
required to learn a list of 20 words. Suppose that observations are to be made 
for five degrees of fatigue and that therefore five different lists are employed. 
The subjects may then be divided at random into five equal groups and the 
lists may be administered according to the following diagram: 


Ai Aa As A; As 

(X=1) (X = 2) (X = 3) (X = 4) (X = 5) 
Gr 1 Lı L: Ls Ls Ls 
Gr 2 Ls Li Le L; Li 
Gr3 Li Ls Lı L Ls 
Gr 4 L Li Ls L L 


Gr5 L; Li 


The tests of Hı, Hs, and H, would, except for the error term employed, be 
the same as in the simple-randomized design. With the Type II design, of 
course, the error mean square for testing the main effect of A would be 
MSerror (w) computed in the manner indicated in Table 11, page 278. This error 
term would also be employed in testing Hs and H.. 

To test the linear hypothesis, one would use 


(sp)” 
it ay MO (129) 
— a 9 error (w) 


in which the degrees of freedom are (a — 2) and a(a — 1)(n — 1), and n is the 
number of subjects in each group. 


350 TESTS CONCERNED WITH TRENDS 


Comparisons of Trends 


Trend Comparisons in Simple Factorial and Treatments X Levels Designs: 
Some experiments are designed, not to test for a single trend, but to determine 
whether or not two or more trends differ from one another. In a simple facto- 
rial experiment, for example, one may wish to test the hypothesis (H;) that the 
trend in the A-means is the same for all levels of B, or, in a treatments X levels 
(A X L) design, one may wish to determine whether or not the A-trend is the 
same at all levels of the control variable. Suppose, for instance, that the 
experiment described on page 341 is replicated at each of three levels of a 
concomitant experimental factor (B), or at each of three levels of a control 
variable (the procedure would be the same in either case). The results might 
be represented as in Figure 11. We may regard these three curves as rep- 
resenting different populations, all *generated" from the same parent popu- 
lation — that from which the three samples were originally drawn. One 
population is like the parent population except that all members have received 
Treatment B; in combination with the A treatments. In another, all members 


Y 


Ficure 11. Treatment means for three levels of B 


have received Treatment B; in combination with the A treatments, etc. The 
hypothesis to be tested is that the various population means coincide for each 
of the given values of X. 

As in testing an a priori hypothesis concerning a single trend, two separate 
tests are required. 

We are first concerned with the more specific hypothesis, Hsa, that the 
various sets of A means follow the same “ pattern," in the sense that the curves 
describing the population means are all “parallel” to one another, or that for 
any two populations the difference in the criterion means is the same for each 
value of X. This, of course, is equivalent to the hypothesis that there is no 
interaction between A and B. This hypothesis is tested by 


F = msap/mss, df = (a — 1)(b — 1)/ab(n — 1) (130) 


DESIGNS APPROPRIATE FOR COMPARISONS 351 


in which a represents the number of A treatments (number of values of X), b 
represents the number of levels of the B factor, and n represents the number of 
subjects in each treatment-combination group. 

J To test the significance of the difference for any particular pair of trends, a 
similar test is employed, except that ms47 is computed only for the data for the 
two levels of B involved. 

The preceding test is concerned with the hypothesis (H;.) that the true 
(population) trends are “parallel” to one another, or that they follow the 
same pallern, but it does not test the hypothesis that the trends coincide. If 
Hs, proves tenable, another aspect of Hs may be tested by 


F = msg/ms,, df = (b — 1)/ab(n — 1). (131) 


This tests the hypothesis (Hs) that the main effect of B is zero, or that for any 
two levels of B the mean of the successive differences in population means is 
equal to zero. However, Hs, could be true even though Hs. were false. If 
either of these hypotheses must be rejected, the over-all hypotheses (Hs) must 
be rejected also. The test of Hs. would usually be applied first, since ordinarily 
there would be relatively little interest in Hs if Hs; (and hence Hs) were already 
known to be false. However, one would still be interested in Hs. even though 
Hs» were known to be false. 


Designs Appropriate for Trend Comparisons 


Almost all of the factorial designs considered in the preceding chapters may 
be employed in trend comparisons. For convenience in the discussion of the 
use of these designs for this purpose, we will define certain terms to be em- 


ployed, as follows: 


We will designate as a “trend factor” one in which the trend of criterion 
means with increasing amounts or at successive levels of the factor is to be 
observed. 

We will designate as a “control factor” one whose effect upon the specified 
trend is either to be observed or controlled. If, in an A X B design the 
trends in the A-means at the various levels of B are to be compared, A is the 
trend factor and B the control factor. If, in an A X B X L design, the A 
trends are to be compared at the various levels of B, and if L is introduced in 
order that its effect upon the trends may be equalized and the comparisons 
made more precise, L also will be called a control factor. It does not seem 
worth while to employ different terms to distinguish between these two 
possible purposes of the control factor, since the distinction has no bearing 
upon the selection of the design or upon the statistical analysis and tests of 
significance employed. We noted previously that it is sometimes difficult to 
decide whether a particular design is to be called a treatments X levels or a 
factorial design, and the difficulty in distinction here is of exactly the same 


352 TESTS CONCERNED WITH TRENDS 


character. A control factor may be introduced into a design both because it 
is desired to compare trends in some other factor at different levels of the 
given factor and in order to increase the precision of other comparisons, and 
neither purpose may predominate sharply over the other. 

We will designate as a “repeatable” factor one all levels of which may be 
administered to the same subject with results meaningful for the purposes of 
the experiment, and for which comparable criterion measures may be ob- 
tained for the subjects at all levels of that factor. The mixed designs of 
Chapter 14 may be used, of course, only if one or more of the factors in- 
volved is repeatable. 

We will designate as a “counterbalanced” factor one whose effect on one 
or more of the other factors is to be counterbalanced in the experiment. 
“Order” of administration and “form” of the criterion tests are prominent 
examples of “counterbalanced” factors. A counterbalanced factor is nearly 
always non-repeatable; otherwise it would be regarded as a control factor in 
a simpler design. 


Most trend studies involve only a single trend factor, but experiments may 
be designed in which trends in more than one factor are to be observed simul- 
taneously. With reference to a given trend factor, there may be one or several 
control factors, one or more of which may be introduced to observe its effect on 
the specified trend, and one or more of which may be introduced simply to 
increase the precision of the other comparisons. The design may also involve 
more than one counterbalanced factor, each of which may be counterbalanced 
with reference to one or more of the other factors. 

The selection of the design most appropriate for a given trend study depends 
obviously upon the numbers of trend, control, and counterbalanced factors 
involved, and upon which of these factors are repeatable; and subsequent 
suggestions for the selection of designs will be organized along these lines. In 
any particular situation, however, the selection of a design may be dictated 
also by factors of administrative convenience or expediency, but because of the 
great variety of possibilities of this type, no systematic provision can be made 
for them here. 

The following suggestions will be limited to the case in which there is only 
one trend factor, and the selection of designs will be considered for various 
combinations of control and counterbalanced factors. 


Only one control factor; both trend and control factors non-repeatable: In this 
case only one design is possible — the A X B design. The tests of H5, and Hs, 
for this design have already been considered. It is possible, of course, to com- 
pare both the A trends for the various levels of B and the B trends for the vari- 
ous levels of A, so that both factors may at the same time be trend factors and 
controlfactors. Similar possibilities exist in all of the subsequent designs, but, 
for the sake of simplicity of discussion, this possibility will not be specifically 
considered again. 


DESIGNS APPROPRIATE FOR COMPARISONS 353 


Only one control factor; trend factor repeatable, but control factor non-repeatable: 
In this situation one might employ the Type I design, A representing the trend 
factor and B the control factor. (However, administrative considerations may 
dictate the use of the less precise A X B design.) The test of Hs, is made by 
F = msap/MSerror w, and of Hs, by F = msp/mSerror œ. 1f tests of simple trends 
are desired, the error term is MSerror (w). 

Only one control factor; trend factor non-repeatable, but control factor repeatable: 
Again the Type I design may be employed, in this case with B as the trend 
factor and A as the control factor. In this case, both aspects of H; may be 
tested by precise tests, but if an F-test of a simple B-trend is desired, the error 
term must be the “within-treatments” mean square computed for the data 
from the given level of A only, for reasons given in the discussion of the Type I 
design on page 271. 

Only one control factor; both trend and control factor repeatable: In this situa- 
tion, the use of the A X B X S design will result in the maximum precision in 
all comparisons, but if the interest is only in the tests of Hs, the design offers 
little advantage over the Type I design. In any event, administrative con- 
siderations may dictate the use of the less precise Type I or A X B designs, 
even though both factors are repeatable. 

Two control factors; all factors non-repeatable: In this case, the only possible 
design is the A x B X C (or A x B X L) design. Comparisons of A-trends 
may be made, of course, both for the various levels of B and for the various 
levels of C, and also for the various BC combinations. If one wishes to test the 
hypothesis that the A-trends are parallel for all BC combinations, one would 
test the ABC interaction. The test of the ABC interaction provides an answer 
to the question “Does the effect of B on the A-trends differ for different levels 
of C, or, does the effect of C upon the A-trends differ for the various levels of 
B?" If, on the further assumption that the A-trends are parallel for all BC 
combinations, one wished to test the hypothesis that the A-trends coincide 
(for these combinations) one would test Hs. by testing the BC interaction. 

Two control factors, both non-repealable; trend factor repeatable: In this case, 
the Type III design may be used, with A as the trend factor. 

Two control factors, one non-repeatable; trend factor non-repealable: Again the 
Type III design may be used, but with A as the repeatable control factor. 

Two control factors; trend factor and one control factor repeatable: The Type 
VI may be used, with C as the repeatable control factor. 

Two repeatable control factors; trend factor non-repeatable: The Type VI de- 
sign may be used, with C as the trend factor. 

Two control factors; all factors repeatable: In this case the AX BXCXS 
design will provide the maximum precision in all comparisons, but any of the 
preceding three-dimensional designs may be employed, depending on con- 
siderations of administrative expediency and the relative emphasis placed on 
the various specific purposes of the experiment. 


354 TESTS CONCERNED WITH TRENDS 


One repeatable control factor; trend factor non-repeatable; a third factor to be 
counterbalanced with reference to the control factor: The Type IV design may be 
used, with C as the non-repeatable (or repeatable) trend factor. 

Trend factor repealable; one non-repeatable control factor, with a third factor to 
be counterbalanced with reference to the trend factor: The Type IV design may be 
used, with C as the non-repeatable (or repeatable) control factor. 

Trend factor and (one) control factor repeatable, with a third factor to be counter- 
balanced with reference to both the trend and control factors: A Latin-square 
design similar to the Type II design may be employed. If we let A and B 
represent the trend and control factors, the square will have ab columns cor- 
responding to the various A~B combinations, with ab rows corresponding to 
the different equal groups of subjects needed, and with C as the Latin-square 
factor. This design again will be practicable only when ab is quite small. 


'The preceding outline should provide for most of the trend studies that 
might be made in educational and psychological research, although, of course, 
this outline could be extended indefinitely for still larger numbers of control 
and counterbalanced factors, using higher-dimensional and more complex 
designs, obtained by combining designs already considered. 


STUDY EXERCISES! 
1. Reconsider Exercise 1 of Chapter 3 (page 101). 


a) Plot the means of the treatment groups in the fashion of Figure 9 on 
page 342, letting X equal the area of the test circle. 

b) What did your computations in the original exercise reveal about the 
presence of trend? 

c) Let us suppose that you have no theoretical or a priori reason to expect 
any particular type of trend, and that your purpose in the experiment is 
primarily to secure an estimate of the true trend, rather than to test any 
particular a priori hypothesis. Obviously, however, you prefer a simple 
type of estimate to a complex one, so you begin by fitting a straight line 
to the data. What is the slope of the straight line of “best fit" to the 
observed treatment means? (See page 321.) On the chart constructed 
in (a), plot the point (X — 48.8, Y — 9.39) representing the general means 
of X and Y, then draw a line with the specified slope through this point. 

d) You may wish to determine if this straight line constitutes a tenable 
hypothesis concerning the trend of the population means. In applying 
the test of this hypothesis, what risk are you willing to take of making a 
Type I error? Justify your choice of a level of significance for this test, 
pointing out the consequences of setting a very high or a very low level of 
significance. Is the linear hypothesis tenable? 


1See second paragraph on page viii. 


STUDY EXERCISES 355 


e) In the test of (d), may either the F itself, or the probability with which it 
would be exceeded if the hypothesis were true, be regarded as a measure 
of “goodness (or badness) of fit," in any absolute sense, of the line to 
the observed means? ...to the treatment population means? Explain, 
pointing out upon what the F depends, in addition to the absolute good- 
ness of fit of the line to the observed means. 


f) On the suppositions of (c), and considering the outcome of the test of (d), 
is there any point in testing any more complex hypotheses? Explain. 


g) On the chart of (a), draw the lines Y = .14X 4-3.2 and Y = .087 X +4.4. 
Do these lines constitute tenable hypotheses (at the 5% level) concerning 
the trend of the treatment population means? Explain. What does this 
imply about the power of the test of (d)? 


h) Suppose that before this experiment was performed, someone had sug- 
gested that the line of population means is described by Y = 2 + .00LX*. 
On the chart of (a), plot the points on this line for the values of X corre- 
sponding to the various “treatments,” and then draw a smooth freehand 
curve through these points. In light of the results of this experiment, 
does this line constitute a tenable hypothesis concerning the population 
means? Explain. Is the hypothesis tenable so far as form of the line 
alone is concerned? Explain. 


i) Suppose that the only purpose of this experiment was to test the hy- 
pothesis that the trend of population means is linear. Would the experi- 
ment then have been more efficient if observations had been made only 
for X = 20, 50, and 79, rather than for five values of X? Explain. 


jp What is the advantage of making observations at five values of X, rather 
than at only three? (Consider the possible forms that the trend of treat- 
ment population means may take if the linear hypothesis is false.) 


2. Reconsider Exercise 2 of Chapter 6 (page 169). 


a) Plot the means of the treatment (J) groups. 


b) Compute the F needed to test the linear hypothesis as applied to the 
corresponding population means. Is this hypothesis tenable (use the 5% 


level)? 


c) Is higher precision needed to reveal that the preceding hypothesis is false 
than was needed to reject the simple null hypothesis concerning the treat- 
ment population means? [See question (i) on page 171.) 


d) Describe how, given the necessary data, you would test the hypothesis 
that the population means for J, and I,, are equal. Does this test involve 
any assumption of homogeneity? If the outcome of this test were signifi- 


356 TESTS CONCERNED WITH TRENDS 


cant, could one retain the hypothesis that the relation between intensity 
of a stimulus and intensity of response is monotonic? Explain. 


3. Reconsider Exercise 3 of Chapter 13 (page 311). 


a) Plot the over-all D means in the manner of Figure 9 on page 342. Plot 
the straight line of best fit to these means. (See Exercise 1c preceding.) 


b) Test (at the 5% level) the hypothesis that in the population the D 
means (over-all lists) fall on a single straight line. What conclusion do 


you draw from the results of this test? 


When a line of the type Y = A + BX + CX? is fitted to these means, the’ 
values of the coefficients are found to be A = 31.125, B = —7.05, and C 
= 1.00. Plot the points on this line for X = 1, 2, 3, and 4 days, and draw 
a smooth freehand curve through these points. 


d) Compute the F needed to test the hypothesis that the population means 
fall on the curve just plotted. 

On a separate chart, plot the D means for each of the four lists, in the 
manner of Figure 11 on page 350. Is the hypothesis tenable that 
the corresponding lines based on population means are "parallel" to 
one another? 

Compute the F needed to test the linear hypothesis as applied to the D 
means for List 1 alone. Do you accept this hypothesis? Explain. 


€ 


LS 


2 


e 


f 


Ra 


g) What is the value of Y on the line of (c) when X = 10? The line fits the 
observed means fairly closely between the values of X — 1 and X — 4, 
but is it a plausible description of the functional relationship between Y 
and X? Explain. What should be the characteristics of the line if it is to 
represent a plausible description of the functional relationship? 


4. Reconsider the experiment by Kalish described in Exercise 4 on page 313. 


a) Plot the T-trends for the four levels of A (see upper right-hand table 
on page 315) and compute the F needed to test the hypothesis that 
these trends are parallel for the population. 


b) The data in the left-hand table on page 315 descrihe the T-trends for the 
16 treatment-combinations. Compute the F needed to test the hypothe- 
sis that these trends are parallel. 


c) Compute the F needed to test the hypothesis that the 16 trends of (b) 
coincide (on the assumption that they are parallel). 


d) Compute the F needed to test the hypotheses that the T-trends for AE; 
AE, A,Es, and AE; are parallel . . . that they coincide (on the assump- 


tion they are parallel). 


Estimation of Variance Components in 


Reliability Studies 


Introduction 


In the preceding chapters, we have been concerned primarily with the use of 
the methods of analysis of variance with particular experimental designs in the 
lesling of hypotheses concerning the treatment effects. "There is, however, an- 
other important application of these designs and analytical procedures, in 
which the interest is in the estimation of population parameters rather than in 
tests of significance. Among the most important applications of this type in 
psychological and educational research are those concerned with the reliabilily 
of the measures obtained. 

The usual procedure in reliability studies in psychology and education has 
been to describe the reliability of the obtained measures in terms of *reliabil- 
ity coefficients." "This procedure is fairly adequate when there is only one 
distinguishable source of random errors, as in computing the reliability coeffi- 
cient of a pencil-and-paper objective test by the “odds-evens items” method, 
although even in this case it is usually desirable to know the standard error of 
measurement as well as the reliability coefficient. The procedure is decidedly 
inadequate, however, when several sources of random error may be distin- 
guished, as is often the case in performance testing. Suppose, for example, 
that in a study of the reliability of measures of the ability to perform a certain 
task, each of a number of subjects is required to perform the task twice inde- 
pendently for each of three different observers — each observer recording an 
independent “proficiency rating” for each performance. The three observers 


arded as a random sample from a population of observers, the six 


may be reg: 
while the n subjects are a 


trials as a random sample from a population of trials, 
random sample from a population of subjects. 
In this situation, it is possible to compute a num 
357 


ber of correlation coeffi- 


358 ESTIMATION OF VARIANCE COMPONENTS IN RELIABILITY STUDIES 


cients which may be described as "reliability coefficients." For instance, one 
could determine, for all subjects in the sample, the correlation between the 
two observations of a single observer, or one could correlate an observation for 
one observer with an observation by another observer. These reliability 
coefficients would have some descriptive value, but they would not lead di- 
rectly to a quantitative estimate of the relative importance of the various 
sources of variation, nor would they contain in themselves any specific sugges- 
tions for the construction of a measurement schedule. A much more useful 
and constructive approach would involve regarding a single obtained score as 
consisting of several independent components, and of securing estimates of the 
population variance of each component. In this case, for example, an obtained 
score (X) may be regarded as consisting in part of a true score (/) which is the 
average of an infinite number of observations of this subject by an infinite 
number of observers, in part of a systematic bias (0) on the part of the particu- 
lar observer involved (a bias which is constant for all subjects), in part of a 
variable bias (i) associated with that observer (variable from subject to sub- 
ject), due to the fact that he may define the task somewhat differently than 
other observers, and in part to variations in the subject’s performance from 
trial to trial plus a variable error of observation on the part of the observer. 
We will let v represent the sum of these last two variations. Thus, 


X=t+oti+» 
Presumably these components are independent of one another, so that the 
variance of X for all subjects in the population is equal to the sum of the popu- 
lation variances of the components. 


ox = ci os + oi d or 
The “error variance” would then be the sum of the last three components. 


That is. 
2 2 2 2 
error = To + 0i + 0». 


If we could secure unbiased estimates of these variance components, we 
would of course know directly what is the relative potency of the different 
sources of variation, and could utilize this information constructively in a 
number of different ways. For example, if we knew the “overhead " cost for a 
single observer, as well as the additional per-observation cost for each ob- 
server, we could determine what number of observations per observer would 
result in the most efficient measurement at a given cost, or in the most reliable 
mean score per subject for a given expenditure. Again, we could determine 
what number of observers combined with what number of observations per 
observer would be required to secure a mean score whose standard error of 
measurement does not exceed a specified value. 

We shall see in this chapter how the methods of analysis of variance may be 
applied to secure unbiased estimates of variance components of this kind, and 
how this information may be employed to secure estimates of various reliabil- 
ity coefficients, as well as how it may be used constructively in further plan- 


THE ONE-DIMENSIONAL DESIGN 359 


ning of measurement schedules. We shall see that in these applications the 
usual F-tests of significance are of little or no interest, that the reliability 
coefficients obtained are of secondary interest only, and that the basic desired 
information consists of the estimated components of the total error variance. 

In this chapter it will not always be possible to provide derivations in terms 
that the student untrained in mathematics can understand, and hence some 
propositions will be presented without proof. 


The One-Dimensional Design 


Suppose that for a random sample of subjects from a specified population a 
number of measures (scores) of a certain trait have been obtained for each 
subject, and that a description is desired of the reliability of these measures. 
Where only two measures have been obtained for each subject, the usual 
procedure is to compute the correlation between the two scores for each subject 
and to regard this coefficient as the “coefficient of reliability” of the obtained 
scores. When several observations have been made of each subject, or several 
scores obtained, the procedure could be followed of computing the correlation 
for each of the possible pairs of scores and of using the average of these coeffi- 
cients to describe the reliability of the scores. A more convenient and satis- 
factory procedure, however, is provided by the methods of analysis of vari- 
ance. 

Suppose we have n scores for each of s subjects, or a total of ns observations. 
Using the methods of Chapter 3, we can then analyze the total sum of squares 
for these ns observations into its “between-subjects” and “within-subjects” 
components, and compute a mean square for each component. We shall let 
Mi represent the mean of the n scores for the ith subject and u; the mean of an 
infinite number of such scores for the subject. In the language of the theory of 
mental measurement, u; is the “true score" for the subject. We shall let c? 
represent the variance of an infinite number of obtained scores for the subject. 
In the language of measurement theory, the square root (¢,) of this variance is 
the "standard error of measurement." We shall assume that c? is the same 
for all subjects in the population. We shall let c, represent the variance of 
the Ms for all subjects in the population and c? = oi, represent the variance 
of the true scores for the entire population. We may then note that a single 
obtained score may be written 

X= wit (X— ps) 
or 
obt'd score = true score + error in obt'd score. 


The variance of a distribution consisting of one obtained score for each 
subject in the population would then be 


2 2 2 2 2 
OX = Oy, EOX) = Fy + 0e. 


360 ESTIMATION OF VARIANCE COMPONENTS IN RELIABILITY STUDIES 


Similarly, the mean of n obtained scores for a single subject is 


Mi = uit (Mi- hi) 


and 


2 2 2 
Ou, = Fn; + OMi 


2 
g, 
=o,+ = 
From this it follows that 
2 
oy = ou, — zs (145) 


Now we know (page 60) that the expected value of ms, is a, or that ms, is 
an unbiased estimate of c?. We also know that an unbiased estimate of on, is 
given by 


i=l 
$—1 
By substituting these unbiased estimates for the variances in the right of 
(145), we can then secure an unbiased estimate of oy as follows: 


XM, - M) 


est'd ci, = 


J 2 
est'd o? = EU M SUD 
Te s—1 ED 
nY (M; - My 
tL a — m$, 


n 
But the first term of the numerator is ms,. Hence, 


MS, — MSw 


estd o = 
n 


(146) 
This is consistent with our previous reasoning concerning the expected mean 
squares in a simple analysis of variance. On page 62 we noted that the ex- 
pected value of ms, is 


Qo ML ny 
E(ms,) = o; + — 


$—1 
If we regard the p;’s in our sample as a random sample from a population of 
XB. : : 
ws, then = à 1 is an unbiased estimate of ø}, = g}, so that we may write 
-Fa 
E(ms;) = o1 + noy. (147) 


We noted earlier that 
E(mss) = o. (148) 


THE ONE-DIMENSIONAL DESIGN 361 


From the two preceding expressions, we then get 
2. E(ms,) — E(ms,;) 
Oy = PUERO PRESA » 


Substituting ms, and ms, for their expected values, we again get 


D ms, — ms, 
est'd o? = uv : (149) 


Now the “coefficient of reliability" of an obtained score for a specified 
population may be defined as the ratio of the variance of the true scores to the 
variance of the obtained scores for this population. That is, if we let ru 
represent the reliability coefficient, 


2 3 
paige GU eL ea 
TT AT tel M 


We can then obtain an estimate of ru by substituting for the variances in (150) 
their estimated values, as follows: 


ms, — ms, (151) 


est'd ru = ASG 


Thus, from the results of a simple analysis of variance of the ns obtained 
scores for our sample, we can secure an estimate of the reliability coefficient of 
the obtained scores for the specified population. This is what R. A. Fisher 
describes as an “intraclass” correlation, the n "classes" in this case corre- 
sponding to the various sets of scores. We note from (145) that the population 


2 
variance of the mean of k scores for each individual is given by o2 + F Ac- 


cordingly, the reliability of the mean (or sum) of k scores for each individual is 
given by 


from which we derive 


est'd ra = ES M ee (152) 


From (151) and (152) it is easy to derive the Spearman-Brown formula for 
estimating the reliability of a lengthened test. 
The reliability coefficient ru may also be estimated in terms of F = ms,/msw. 
Dividing the numerator and denominator of (151) by msy, we get 
TEL 
H A e AU LE 
est'd rn PEG (153) 


It is possible to establish a confidence interval for r using the table for F, 
granting that a confidence interval consistent with the tabled values of F is 
selected. We do this by first establishing the confidence limits for F, and then 


362 ESTIMATION OF VARIANCE COMPONENTS IN RELIABILITY STUDIES 


substituting these in (153). To obtain the upper limit (F,) of the (100 — 2X)^; 
confidence interval for the true F, we simply multiply the obtained value (F,) 
by the value of F at the X% point in the F-distribution, as read from Table 3, 
page 41. That is, F, = Fy Fxq- The lower limit (F;) of the confidence inter- 
val is given by F, = F,/Fx«. For example, if the obtained F, for 5 and 14 
degrees of freedom, is 11.7, the upper limit of the 90% confidence interval is 
11.7 X 2.96 = 34.63, while the lower limit is 11.7/2.96 = 3.95. 

Having thus established the limits of the confidence interval for F, these 
may be substituted in (153) to obtain the corresponding upper and lower 
limits for r. For the preceding example, these are .85 and .33, n being equal 
to 6. 

Since the tabled values of F are restricted to X = 20, 10, 5, 2.5, 1.0, 0.5 
and .1, confidence limits for r can be conveniently established in this fashion 
only for the 60, 80, 90, 95, 98, 99 and 99.8% confidence intervals without 
resorting to interpolations between the table values of F, but these should 
suffice for nearly all practical purposes. 

For large values of s, the sampling distribution of est'd c; is approximately 
normal, and its estimated variance is 


x3 L2 (ast msi) 
est'd cesta e ES e Ge dh d 
Similarly, est'd o? is normally distributed for large values of df», and its esti- 
mated sampling variance is 


2ms; 
est'd cesta of = 73 jf =. 
w 


Confidence limits for o? and c; can therefore be readily established if s is 
large. 


The Two-Dimensional Design 


The application of the two-dimensional design in reliability studies may 
best be explained in terms of a specific illustration. Suppose that a random 
sample of s subjects is drawn from a certain population, and that for each 
subject n independent observations or measurements of a certain trait are 
made by each of a observers. These a observers are regarded as a random 
sample from a population of observers. The ans observations or scores may 
then be entered in a double-entry table, in which columns correspond to 
observers (Ai, As,...A,), rows to subjects (Si, S»... S), with n observa- 
tions or obtained scores in each cell. For each mean in this table there is a 
corresponding “true” mean. The true mean 4; (corresponding to Mi; a cell 
mean) is the mean of an infinite number of observations of subject i by observer 
j. This, in the language of mental measurement theory, is the “true score " for 
subject i so far as observer j is concerned. We will call it an “observer true 
score," or an A true score. The true mean u;, (corresponding to M;,, a row 


THE TWO-DIMENSIONAL DESIGN 363 


mean) is the mean of the observer true scores for subject i for an infinite num- 
ber of observers. This is the “true” true score for subject i. We will call it 
simply his true score. The true mean j.; (corresponding to M.; a column 
mean) is the mean of the observer true scores for observer j for all subjects in 
the population. It is also the mean of the distribution consisting of a single 
obtained score from observer j for each subject in the population. The true 
score mean u (corresponding to M, the general mean) is the mean of the true 
scores for all subjects in the population. 
We may now write the identity 


(X — pa) t (4s = B) + (us. — 9) + (usi — bis. — Bai |) 
ect ac yc ay. (154) 


X-p 


In this expression, e = (X — nj) represents the error of observation or of 
measurement which is specific to observer j and subject i. Its variance is o?, 
which we will assume is the same for all subjects and all observers. The term 
y = (ui, — u) represents the deviation of the subject's true score from the 
population mean. Its variance, which we will represent by c?, is the popula- 
tion variance of the true scores, and may be called the “trait” variance for 
the population. The term a= (u.; — u) represents the “general bias" of 
observer j. We will let its variance be represented by o2. The term ay = (mi; 
— Mi. — H.i + p) is the true interaction effect for subject i and observer j. It 
may be written [(u;; — wi.) — (u.; — u)]. The term (u;; — u;.) represents the 
total bias in observer j’s observations of subject i. The difference between 
(uii — pi) and (u.i — u) represents the difference between observer j’s total 
bias for subject i and his general bias, and may be called his specific bias with 
reference to subject i. The population variance of these specific biases, that. 
is, the population variance of the true interaction effects, we will represent by 
oi. We will assume that it is the same for all observers. Thus, (154) indi- 
cates that the deviation of an individual's obtained score from the population 
mean consists in part of a deviation of his true score from the population mean, 
in part of a specific observer bias, in part of a general observer bias, and in part 
of an error of measurement or of observation. 

Following the methods of Chapter 6, we may analyze the total sum of 
squares among the ans observations in the sample into its between-subjects 
(S), between-observers (A), subjects X observers (AS), and within (w) com- 
ponents, and may compute a mean square for each component. The expected 
values ! of these mean squares are as follows: 


1 Suppose that corrections are applied in the double-entry table so as to eliminate 
mss and ms,. This twice-corrected table may then be regarded as representing a sim- 
ple-randomized design, in which, according to the reasoning underlying (147), 


E(ms,s) = ot + noi. 


Proofs of (155) and (156) will be supplied later (footnote on page 367). 


364 ESTIMATION OF VARIANCE COMPONENTS IN RELIABILITY STUDIES 


E(ms4) = o3 noz, + snos (155) 
E(mss) = ci nos, + ano, (156) 
E(msas) = oè + neay (157 
E(ms,) = o (158) 


Hence, we can secure estimates of each of the variance components as follows: 


2 ms4 — Msas 
est'd o = ——— —— 
sn 
2 _ Mss — Msas » 
est'd o; = — 45 (159) 
an 
2 _ MSas — ms 
est'd gay = ia 


2 
est/d o = MSw. 


Estimates of the sampling variances of these variance estimates are given by 


2 (msi mss 
est'd oz: = — (= T 
sa T n*s*Ndfa i dfas 


TAN 2 mss más) 
est'd o; = ns + a (160) 


2 2 
2(msas , MSw 


doe eme aE 
estd oe, Hum i 3 


est'd oz: = Sir 


When the degrees of freedom involved are large, the variance estimates are 
approximately normally distributed, and approximate confidence limits can be 
established in the ordinary way. The use of these estimation techniques in 
reliability analysis is definitely not recommended when s is small. 

From these variance estimates (160), we may infer a number of “reliability 
coefficients” for the entire population of subjects. One of the most significant. 
of these is the correlation between the mean of n’ measures in each of a’ cate- 
gories of A, and the mean of n’ independent measures in each of these same 
A-categories. One of these means (for subject i) may be represented by 


quos 7 3 = 
Miz = TeMi and the correlation by ry,,y;. In general, in the notation of 


j=l 
this chapter, a Roman numeral in the subscript will indicate that a specified 
number of categories of the corresponding factors have been averaged. In this 
case, the I in M, means that for subject (i) the mean of a’ categories of the A 
classification has been obtained, whereas the 1 in M, would refer to a single 
A-category (A,). 


THE TWO-DIMENSIONAL DESIGN 365 


The reliability coefficient for any “obtained” score for a specified popula- 
tion is defined as the ratio of the population variance of the corresponding true 


scores to the population variance of the obtained scores. In the case of rayar 
' 


a 
the obtained score is M;r. The corresponding population mean is pir = qe. $ 

a à : : 3 al 
To obtain an expression for the population variance of Mir or of 


lg 
(Miz — n.r) = TaM: — p.i), 


we write the identity 


1.3" le! 
v (Ms ha) = va Mi — nij) 


15. 
+ rh (mii — n. aat nu) (161) 
j=1 
Ig 
va — p). 


Since the right-hand terms of (161) are independent, the variance of the left- 


hand term, or of Mir, is equal to the sum of the variances of the right-hand 
2 
terms. The variance of (M;; — mii) is oe Hence, the variance of the first 


right-hand term is 


1 a gs x a 
Cen am 
The variance of (ui; — Hi. — H.i + H) iS G2; hence, the variance of the second 


2 
right-hand term ise. We next note that 


zh — p) = (n.;— 1), 


since (u,; — u) is the same for all values of j. The variance of (u.; — u), or of 
the third right-hand term of (161), is oł}. Hence, the variance of Mir is 


2 
dua -pp = ut E h (162) 


The variance of the corresponding true scores is found by letting n’ = « in 
2 


(162). Hence, the variance of the true scores is +o}. Accordingly, 
a as o, 
TM Mi = a "gs (163) 
Ur VE Ge RS 


366 ESTIMATION OF VARIANCE COMPONENTS IN RELIABILITY STUDIES 


To estimate Ty arp WE substitute for the variances in (163) the estimates of 


them (160) secured from the sample. Thus, 


MSas — ms, , MSs — MSas 
an t an 


qM. * 
TUO qms, | MSas — MSw , mss — MSas 
a'r a'n ‘ an 


est'd ry (164) 


From (164) we can secure estimates of a number of more specific correlation 
coefficients. For example, if we wish to estimate the correlation between two 
observations of each subject by a single observer, we let n' = 1 and a’ = 1 in 
(164). If we wish to estimate the correlation, for all subjects in the population, 
between the means of two groups of n' observations by the same observer, we 
let a’ = lin (164). If, for a particular group of a’ observers, we want the self- 
correlation of the mean of a single observation from each observer, we let 
n’ = lin (164). This latter result would be more meaningful in the case in 
which A represents “items” in a test, in which case, by letting n’ = 1 in (164), 
we secure the “test-retest” reliability coefficient of the mean score on the test. 
If we wish the self-correlation of the means obtained in our sample, we let 
a’ = a and n’ = nin (162). In the latter case, (164) reduces to 


mss — M8 _ 4 1 
= 1- > 


est'd ra, = 
Nor mss Fy 


in which F; = mss/ms,. In this case, a confidence interval for rar 
readily established by the method described on pages 361-362. 

Another correlation coefficient of considerable general significance is that 
between the mean of n’ observations in each of a’ categories of A, and the mean 
of n' independent observations in each of a’ independent categories of A. One 
of these means (that for subject i) may be represented by 


irM;r May be 


Ma = 15 Mu, 
aj 


and the second by 


jui 
Mar = q 2 Mi. 


j=a'+1 


The correlation coefficient is then represented by Une rr: 
To estimate this reliability coefficient, we must first find an expression for 
the variance of Mir. We first note that the corresponding true score is 


LŠ 
= eu. {when a/—«] = p. 
a j 
What is needed, then, is the variance of Mj about u, or of 


(Ma = 9) = (Ma — a) 


j=l 


THE TWO-DIMENSIONAL DESIGN 367 


We must first write the identity 
foe det 
g Mia- n = h Mu- Hii) 
j=l a ja 
qus 
y Xs n. nat p) 
I (165) 
Te 
+5 Dee =») 


ag 
Ds 
tyN6- n). 
j=l 
Since the variance of the right-hand terms of (165) are independent, the vari- 
ance of the left term, or of Mir, is equal to the sum of the variances of these 
2 2 
terms. The variance of the first right-hand term is eu of the second is PS 
of the third is o) and of the last is zero, since (u,; — u) is the same for all sub- 
jects. Accordingly,' 


2 
r= A oem (166) 


The variance of the corresponding true scores is found by letting n’ = « and 
a’ = œ in (166) and the result is o2. Thus, 


* ay 
TMgMur Tog cgi 71 à (167) 
ant a t7 


To estimate rw; we substitute from (159) in (167) and secure 


mss — MSs 
an 
, Z : 
est'd ruis = s, p Pas — ms, , mss — Mas Mp) 
an a'n an 


Again we can secure several more specific reliability coefficients from (168) 
by letting a’ and n’ take different values. If we want the correlation of the 


1 To prove (156), we note that when a' = a and n’ = n, (166) becomes 


2 2 2 2 
ang M, = ano, + nos, + Se 


But 
anoy, = an - E[Z(M, — My/G — D] 
Pa" i. 
-—Ej———— FS Ee 
s—l1 
= E(ms,). 
‘Thus, 


E(ms;) = anc?, + not, + ci. 


A. similar proof can be provided for (155). 


368 ESTIMATION OF VARIANCE COMPONENTS IN RELIABILITY STUDIES 


mean of n’ observations by one observer with the mean of n^ observations 
by another observer, rs „^; We let a’=lin (168). If we want the correlation 
of the mean of one observation from each of a’ observers with the mean of 
a second observation from each of the same observers, we let n' — 1 in (168). 
(Again, this correlation is more meaningful when A represents "items" in a 
homogeneous test. In this case, by letting n' — 1 in (168), we secure the 
* equivalent forms" reliability coefficient of the mean score on the test.) If 
we want the correlation of a single observation by one observer with a single 
observation by another observer, Tsz; We let a’ = 1 and n’ = 1 in (168). 
If we want the reliability of the M;'s obtained in our sample, we let a’ = a 
and n' = n in (168), which in this case reduces to 


i 1 
est'd ry; —1— m 
in which F; = mss/msas. Again, a confidence interval for this r may be es- 
tablished by the method described on pages 361—362 . 

If we want the reliability of an observer (A) true score, we let a’ = 1 and 
n' = œ in (167), which then reduces to 


oy 
Taimio 7 —4— —- 
ii oy + Tay 
and the estimated value of which is 
MSs — MSas 
an 
est’d ru, = 
a mss — MSas a MSas — MSw 
an n 


n mss — msas 
mss + (a — l)msas — a-ms, 


This may be regarded as an estimate of ry, corrected for attenuation 
due to variations in the observations of a single subject by a single observer. 
It is an estimate of the maximum reliability that can be secured in a mean 
score based on observations from a single observer. No matter how many 
observations are made of each subject by a single observer, the reliability of 
the mean of these estimates cannot exceed r,,, 

If there is no intrinsic AS interaction (22, — 0), Tuang Will equal 1.00. 
In the specific application here considered, in which the A categories cor- 
respond to observers, an r,.,,,, of less than 1.00, or an AS interaction other 
than zero, would imply that different observers are not really observing 
the same trait, or that each defines the trait differently than the other observ- 
ers. In many applications of this design, one would expect r,,,,, to be 
unity, but in some applications, as we shall see later, an intrinsic AS inter- 
action may be quite plausible. 

The discussion has thus far been based on a specific illustration in which 
Ai, Áo, ... A, correspond to different observers and in which X;; represents 
a single observation or measurement by a single observer. There are many 


THE TWO-DIMENSIONAL DESIGN 369 


other interesting applications of this design, only a few of which can be 
suggested here. Suppose, for example, that in a study of the measurement 
of writing ability, each of a number of college freshmen is requested to write 
three different themes on comparable subjects, and each theme is read twice 
independently by the same reader. In this case, Ai, Az, and As represent the 
three themes, and X; represents a single rating of theme 2 for subject i. 
The coefficient r,,,.,, between the two ratings of the same theme may then be 
regarded as an index of the reliability of the reader, or might be described 
as the “objectivity” coefficient. The coefficient r;;..., would then represent 
the reliability of a single rating of a single theme for the particular reader 
involved, but would not take into consideration possible biases specific to 
individual subjects on the part of this reader. In this case, rjj, would 
represent the maximum reliability that could be achieved in a score based 
on a single theme by averaging a number of independent ratings by the 
same reader. Since the same subject may reveal different aspects of his 
general writing ability in different themes, one would expect r,;,, to be 
less than unity: that is, one would expect an intrinsic AS interaction, Each 
theme might be regarded as representing a sample of the subject's writing 
performance, and a is a measure of the variation in these samples from theme 
to theme. 

Suppose, as another example, that each subject writes only one theme, 
but that the theme is read twice independently by each of two readers. In 
this case, we can compute a coefficient of reliability of a single rating of a 
single theme by a single reader (zaz) which does take into consideration 
the effect of biases specific to individual subjects on the part of a single 
reader, but which does not take into consideration the effect of variations 
in the subjects’ performance from theme to theme. If each subject wrote 
four themes, two of which were read by one reader and two by another, we 
would have a situation like the original illustration, in which rz; would 
take all important. sources of error into consideration — the specific bias of 
the reader, the variations from theme to theme, and the "reading" error, 
but would not permit an independent estimate of the objectivity coefficient 
(the reader reliability), or of the component of the total variance associated 
with the reading error. 

If Ai, Ao, . . . Aq represent the items in a test, and if a single entry in each 
cell of the table represents the score made on a single item by a single subject, 
then rur; (with a’ = a and n’ = n = 1) is equal to 


1 
, = ad pia 
TM Mir — 1 F; 


in which Fz = mss/msas. Since the correlations between the sums of a num- 
ber of variates is the same as the correlations between their means, Ta Mrr 
is also the reliability of the total score on the test ofa items. This procedure ' 

1 C. Hoyt, “Tests of Reliability Obtained by the Analysis of Variance,” Psycho- 
melrika, vol. 6 (1941), pp. 153-160. 


370 ESTIMATION OF VARIANCE COMPONENTS IN RELIABILITY STUDIES 


for computing the reliability of a test has been known as the “Hoyt pro- 
cedure.” A confidence interval may be established for the true r in the 
manner already suggested. 

In other applications, Aı . . . Aa might represent “equivalent forms" of a 
test of considerable length, so that variations from form to form due to 
sampling of content are negligible. These forms might be administered to the 
subjects on different days, in which case o2, would be a measure of “diurnal 
variations” in the subject. If the forms were very short, and diurnal varia- 
tions were negligible, o2, would represent variations in the forms, rather 
than in the subjects, variations due to the sampling of content. 

Although the reliability coefficients just considered are valuable for de- 
scriptive purposes, our major interest, for more constructive purposes, is in 
the estimates of the separate variance components. We may wish to answer 
such questions as “How many observations of a given trait must be made by 
each of five observers in order that the reliability of the mean of all observa- 
tions for a subject may not be less than .90?” or “Fora given total cost, what 
values of a’ and n’ will yield the maximum reliability in the mean of all 
observations for each subject, or will result in the smallest error variance?” 
To answer such questions, we must first analyze the results for a random 
sample of subjects from the population involved, in order to secure estimates 
of the variance components. From these estimates we can then determine 
the error variance for any specified values of a’ and n’, and can estimate 
the corresponding reliability coefficients. Suppose, for example, that we have 
secured two observations on each of 60 subjects from each of four observers. 
Suppose that an analysis of variance of the results yields 


ms, = 32.00 
mss = 10.00 
msas = 30.00 
ms, = 20.00. 
From these 
est’do2 = 5.00 
est'd gie 5.00 


est’do? = 20.00. 


Suppose that we wish to secure measures of the given traits for a group of 
200 subjects, and that we have just $1,000 to spend for this purpose. We find 
that for this amount we can have each of eight observers make one observa- 
tion of each subject. Suppose, also, that for the same amount we could have 
each of five observers make two observations per subject, or each of four 
observers make three observations per subject, or each of three observers 
make five observations per subject. The question is “Which of these combi- 
nations will result in the smallest error variance in the obtained mean scores?” 


THE TWO-DIMENSIONAL DESIGN 371 


We know that the error variance (e ) of the mean of n’ observations by each 


of a' observers is given by 


ci 

2 a 
T IDE 
un a ^n^ 


Accordingly, we can construct a table as follows: 


a’ n’ | a/n' a /a' t et/a'n' = gan 
8 1 8 5/8 + 20/8 =3.12 
5 2 10 5/5 + 20/10 = 3.00 
4 3 12 5/4 + 20/12 = 2.92 
3 5|" 15 5/3 + 20/15 = 3.00 
2 10 20 5/2 + 20/20 = 3.50 


from which it is evident that the most efficient procedure is to secure three 
observations of each subject from each of four observers. If this procedure 
is followed, the estimated reliability of the mean of the twelve observations 
for each subject is 
oi 5 
est'd r^^, = bil 7EEX3927 .63. 

If we then wish to know what would be the cost of obtaining a mean score 
whose reliability is .90, (assuming that an increase in the number of observers 
is accompanied by a proportional increase in cost), we would substitute the 
known values in (167) and solve for a’ to secure a’ = 21 (rounded), which 
would mean that the estimated total cost would be 2+ X $1000 = $5250.00. 

It will be apparent from this illustration that negative estimates of some 
of the variance components might sometimes be secured. For example, if 
in this illustration, ms4s had been, say, 18.00, the estimated value of G 
would have been —1.00. Since negative variances are impossible, an estimate 
of zero must be substituted for the negative estimates in such cases. In 
such cases, biased estimates of c? and c? will result. This is another reason 
why these techniques should not be employed in reliability studies when 
the degrees of freedom involved in the variance estimate are small. With 
large numbers of degrees of freedom, negative and biased estimates are less 
likely to occur, and, if they do occur, the bias introduced is less likely to be 
serious. This is equivalent to saying that these techniques should be em- 
ployed in reliability studies only when s is large. In a case like that just 
considered, it would sometimes not matter if a were small, since msa might 
not be used in estimating any of the variance components in which we are 
interested. The component c? is usually of little interest since, in most 
applications, the same A-categories (for example, observers or equivalent 
forms of a test) are used with all observations, so that A is only a source of 
constant bias rather than of variable error. 


372 ESTIMATION OF VARIANCE COMPONENTS IN RELIABILITY STUDIES 


The Three-Dimensional Design 


We shall first discuss the three-dimensional design in general terms, and 
later consider some illustrative applications in more specific terms, In all 
applications, both 4 and B are presumably random effects. That is, the 
A-categories in the sample are presumably a random sample from a popula- 
tion consisting of an infinite number of A-categories, and similarly for B. 
The AB combinations (cells) in the sample are therefore also a random sample 
from a population of AB combinations. A constant number (n) of observa- 
tions or scores is obtained for each of s subjects for each of a categories in the 
A classification within each of b categories in the B classification. Assuming 
that c? is the same for all subjects, and that the interaction variances are also 
homogeneous, the mean squares obtained from an analysis of variance of the 
absn observations, and their expected values are as follows: 


Mean 
Squares Haxpected Values 
2 2 2 
msa o+ nom  +snoès  +bnoèy + bsnoa 
2 2 2 2 2 
msn oi + noua, —  SnGag -Fanos, + asnog 
2 2 2 2 2 
mss oe + NG apy + bna, + anos, + abnoy 
2 2 2 169 
MSAB oe + NG apy + snoas (169) 
2 2 E] 
Msas oO. + NOapy + bnoay 
2 2 2 
m$ns O + NG apy + anos, 
2 2 
msAns Oe + NO apy 
2 
w 9. 


From this it follows that estimates of these variance components may be 
obtained as follows: 


2 
est'd c, = ms, 


2 _ mIm$Ans — MSw 
Cay = ESAE I 
2 _ msps — MSABS 
By = BG 
2 _ MSas — MSABS 
Cay = oo) ee (170) 
2 _ MSAB — MSABS 
dafs sn 


2 _ MSs — MSas — MSps + MSaBs 
abn 


a 
2 
i] 


THE THREE-DIMENSIONAL DESIGN 373 


‘Che estimated sampling variances of these variance estimates are given by 


2ms, 


pror qs 


$ 2(msizs mst) 
"arcs 
estid Cane aie 


2 (msi ms 
est’d eh --rj Bs | MSans 


ay an dfss — dfass (171) 
s pay) (ms mins) 
est'd on = Vadis + dm 


2 2 
2 [msás , msaáns 


eet aes xp ex) 


2 es , más , Mas ees 
a*b*n* dfs d dfas i dfss dfAns à 

For large values of the numbers of degrees of freedom involved, the sam- 
pling distributions of these variance estimates are approximately normal, so 
that approximate confidence limits may readily be established for them. 
This method of estimating confidence limits should not be used when any 
number of degrees of freedom involved is small. Since in most applications 
dfa, dfs, and dfas will be small, confidence limits for 02, 03, and c?, may nol 
usually be established in this fashion. Since we are rarely interested in these 
components, however, this is not of much consequence. 

To attach general meanings to these components, let us first write the 
identity 


estd oc? = 
Y 


Yi a B 
(Xii — B) = (ui. — u) + (u.i. — H) + Qu. — B) 
ap By 
+ (Maik — Hi. — Mek T M) + Mie Hi. — Hoa B) (172) 
ay apy 
+ (wij. — Bi. — Beg H B) F (Mise Bii — Bin — Bst Bis Ba. 
+ u.a H) 
€ 


+ (Xie na). 


In this expression, a = (u.;. — u) represents an “A effect," which may 
be regarded as the general bias which is associated with a given A-category. 
The term £ similarly represents the general bias which is associated with a 
given B-category. The term y represents the deviation of the subject’s true 
score from the population mean, and its variance, 92, is the trait variance for 
the population. The term of represents the bias which is specific to a given 


374 ESTIMATION OF VARIANCE COMPONENTS IN RELIABILITY STUDIES 


AB combination. Its variance, oł, is the variance of the true AB inter- 
action effects in the population. The term ay represents the bias which 
is specific to a given A-category and a given subject. Its variance, c7, is 
the variance of the true AS interaction effects in the population. The term 
By is similarly defined, and its variance is the variance of the true BS inter- 
action effects in the population. The term afy represents the bias which is 
specific to a given AB combination and a given subject, and its variance, 
gi y 18 the variance of the true ABS interaction effects in the population. 
The last term, e, is the deviation of a single obtained score from the mean of 
an infinite number of such scores for a given subject in a given AB-category, 
and its variance, o? may be called the basic error variance. All of these 
effects are independent of one another, so that the variance of all obtained 
measures for all subjects in the population, including an infinite number of 
A-categories and an infinite number of B-categories, is given by 
ck, = Txt 03 + 0; + oas + Cay + asy + apy + oe (173) 
From the estimates of these variance components, (170), we can infer a 
number of correlation coefficients of considerable general significance. One 
of these is the correlation, for all subjects in the population, of the mean of 
n! measures in each of a’ categories of A within each of b’ categories of D, 
and the mean of n’ independent measures in each of the same A and B cate- 
gories. We will let 


Min = 3 Mas 


purs 


represent one of these means (that for subject i), in which case, 
1 a bY 
Mall = ay hi 
is the corresponding true score for aly i, and 
roga 


j=l k=l 


is the population mean of these true scores. 
We may now write the identity 


© © 


a’ b 


1 a VW 
zy 27 Mir- Maik) = pl Min 229] 


j=l k=l i=l k=l 
© 
tapa Iii. T Bt diat ues digo Bia B) 
© 


ED Mie eae IE) 


j=l kel (174) 


THE THREE-DIMENSIONAL DESIGN 375 


© 
12H 
+ op alu = qd. — uae p) 
j=l k=l 
© 
1 a’ v 
T rp Pa =): 


For convenience, we will refer to the terms in this identity by the circled 
numbers above them. We may note then that 


eet 
@ = up Mii. — Hi. hi p) 
k=l 


Q fa 
pan 

= So (wi. — me. — Hit g) 
a £X 


since the quantity in parentheses is the same for all values of k. We may 
note likewise that 
1c 
ONT pin — pi. — B. B): 
kl 
Now, since the terms in the right of (174) are independent of one another, 


the variance of the left-hand term is equal to the sum of the variances of the 
right-hand terms. That is, 


a8 2 2 2 2, 
Oe gy tie Hired a1 DE 


But 
a x 2 
og Lev UP 
93 7b es Mw abn’ 
Also, 
2 1 Le Taby 
= Fap E 
M5 ab? »» aby = ab 
2 
Cg. 
a 22m 
2.75 
6 b , 
o? = o. 
© Y 
Accordingly, 


2 2 2 
[f g, [f 0$. 2 
D Vn i ab’ $ d d a Rr 


376 ESTIMATION OF VARIANCE COMPONENTS IN RELIABILITY STUDIES 


The variance of the corresponding true scores is 
D. bi tir [when n'—«] 


2 
LA L^ T; 2 
= at pte 


from which it follows that 


Taby Cay oy 
TuinMiar; =E ae odi E. -= ; j (175) 
€ DPT sy "Y 
ap wb SPESE b Ttc 

More specific correlation coefficients may then be secured by letting a’, 
b’, and n' take various combinations of values in (175). For example, if we 
wish the average correlation (for all B-categories) of the mean of n’ measures 
in each of a’ categories of A within a single B-category, with the mean of n' 
independent observations in each of the same A-categories within the same B- 
category, we let b’ = 1 in (175). If we wish the correlation of the mean of 
n’ observations in a given AB sub-category with the mean of n' independent 
measures in the same AB sub-category, we let a’ = 1 and b = 1 in (175). If 
we let a’ = 1, b’ = 1, and n’ = 1, we have the correlation of a single measure 
in a given AB sub-category with a second measure in this same sub-category. 
To estimate any of these coefficients, we substitute the values obtained from 
(170) in (175) and give a’, b', and n’ their appropriate values. Again, in the 
case when a’ = a, b' = b, and n' =n, the expression (175) reduces to one 
which can be written in terms of an F, so that a confidence interval may be 
established by the method of pages 361-362. 

Another coefficient of general significance is that between the mean of 
n' measures in each of a’ categories of A within each of b’ categories of B, 
and the mean of n’ independent measures in each of a’ independent categories 
of A, within each of the same categories of B. We will represent one of these 


means by 


a v 


Min = ape. 


j=l k= 


and a second such mean by 
1 2a" 2b. 


M, > Min: 


IPC a'b ima aya 
The corresponding true score for subject i is 


a’ V 


Hirr = aie [when a' — «] 


ust Eti 


fia 


e 
= t 


THE THREE-DIMENSIONAL DESIGN 377 


and the population mean of the true scores is 


1 b a V 
TEE S» 
kæl j=l k=l 
We may now write the identity 
© © 
1 “2 etl 
ae Min =H.) = Thha- uie) 
a e 
* zoho ij. — Bik — B. T ss + kat kT BD 
a v 
in abe 2.7 as a) 
(176) 
EEO He. — n.a B) 
© 
1 a b 
+ Tey — u.j. — H.k + B) 
© 
CX aC 
tape X. - 
j=l k=! 
® 
1 
T rp 220 p) 
We may next note that 
2 2 2 
A Ue nL s n Day 
79" an" "e ab? a 
2 
A e 2_ 0, È= o, and e? = 0 
Accordingly, 
2 2 
2 NC Taby Tay Thy 2 
Mir = on + aota y ev 
The variance of the corresponding true scores is 
Thirr = Miz [when n' — © and a’ — œ] 
- D tc 
"The desired coefficient is then given by 
thy +o 
ET = . (177) 
TMinMimr E n Em Hz E » ae PRI 
abn’ a'b’ Y 


378 ESTIMATION OF VARIANCE COMPONENTS IN RELIABILITY STUDIES 


Again we may secure more specific correlation coefficients by letting a’, 
b’, and n’ take various values. For example, if we wish the average correlation 
(for all B-categories) of the mean of n’ measures in each of a’ categories of A 
within a single B-category, and the mean of n^ independent measures in each 
of a’ independent categories of A within the same B-category, we let b’= 1 in 
(177). If we wish the average correlation (for all AB sub-categories) of the 
mean of n/ measures in one AB sub-category within a given B-category with 
the mean of n' measures in another AB sub-category within the same B- 
category, we let a’ = 1 and b’ = 1 in (177). 

A correlation of particular significance is obtained by letting a’ = 1 and 
n’ = 1 in (177). The result is the correlation between the mean of a single 
measure within each of b’ categories of B within a single A-category, and the 
mean of a single measure within each of the same B-categories but in a differ- 
ent A-category. If B represents “items” in a test and A represents “ equiva- 
lent forms" of the test, this is the equivalent forms reliability of the mean 
score on the test. 

e +o i 
TMirr Misr — CNET ZF. 178) 
Y daz b doceo 

By reversing the roles of A and B in (177) we can obviously secure a similar 
expression for ry ur rr 

Again we can estimate any of these correlation coefficients by substituting 
from (170) in (177) and letting a', b', and n' take the appropriate values. 
As in former instances, if we do this and let a’ = a, b' = b, and n’ =n in 
(177), the result may be expressed in terms of an F, so that its confidence 
interval may be established. 

Another coefficient of considerable general significance is that between 
the mean of n^ measures in each of a' categories of A within each of b' cate- 
gories of B, with the mean of n' independent measures in each of a' independent 
categories of A, within each of b’ independent categories of B. We will repre- 
sent one of these means by 


a b 


Maro LSEM 


j-1 k=l 


and a second similar mean by 
2b. 


$ Mim 


j-a^l k=b'+1 


Minn = 3p. 


The corresponding true score for subject i is 
a bY 


Hi = apii [when a’ — © and b' ^ œ] 
= pi 


and the population mean of these true scores is p. 


THE THREE-DIMENSIONAL DESIGN 379 


We may now write the identity 
© [9] 


LED Mar 0) = do Ya ma) 


vv [x Aa 
© 
1 a V 
ae zy2; 220 = Hu. — Bk — hi F iis. T Bg: + ex = B) 
A 


© 


a’ vw 


ur 1. — ug. dp) 
G 


a b 


+ ope ga Mi. — Mok + 1) (179) 


© 


x. By 


+ oyu. dC Haj. — ok +B) 


i=l k= 


© 


a b 


+ pic 


j=l k= 


a b 


tap been i. 


jal k=l 


® 


a b 


+ op Ld. oa 
jal k=l 
We next note that 
2 2 2 
E E oi c! = Seby ef ay | Aine Ofr, 
© abn" e ab" o a" © b’ 
2 2 2 2 2 
uy 0, doo 0m 6S7 0, and T 0. 


Accordingly, 


2 2 
2 oe Gay el TBy 2 
OMi = aiu ab’ uic: b + oy 


The variance of the corresponding true scores is 


2 / 
Guiry = ei, [when n' > ©, a' > æ, and b’ — œ] 


= Oy: 


380 ESTIMATION OF VARIANCE COMPONENTS IN RELIABILITY STUDIES 


The desired reliability coefficient is then given by 


2 
Ty 
TMinMimm — P 5 (180) 
a NT NUN 
rudi ab’ heat y to, 


Again we may secure more specific reliability coefficients by letting a’, b’, 
and n’ take various values in (180). For example, if we let a’ = 1, b’ = 1, 
and n’=1, we have the correlation between a single measure in a single 
A-category within a single B-category with a single measure from a different 
A-category and a different B-category. This coefficient, which may be de- 
noted by rx x, best deserves being described as “the " reliability coefficient 
of X;5 for a specified population of subjects, since it takes into consideration 
all important sources of error identified in the analysis so far as the selected 
A and B categories are concerned. 

We may in this situation distinguish several types of “true” scores. The 
mean of an infinite number of observations of a given subject in each of an 
infinite number of B-categories within the same A-category (Category 1), 
which has been denoted ui., may be called an “A-true score." Similarly, 
Mia represents a “B-true score," and win represents an “AB-true score." 
The mean of an infinite number of AB-true scores for an infinite number 
both of A-categories and B-categories is “the” true score, and has been 
denoted by pi... 

The correlations between different A-true scores or different B-true scores, 
or between different AB-true scores in the same A-category, or different 
AB-true scores in the same B-category, or between AB-true scores from both 
different A and different B categories may all have considerable practical 
significance in planning a test or a measurement schedule. These may all 
be estimated from the estimated variance components. 

The “reliability” of an A-true score, that is, the correlation between 
different A-true scores for the entire population of subjects is obtained by 
letting a’ = 1, b' = «, and n’ = « in (178), which then reduces to 


Tuir mia, — 


Similarly, the correlation between AB-true scores from different A and B 
categories is obtained by letting a’=1 and b’=1 and n’= = in (178), 
which reduces to 

oy 


Pike 24 mc ead 
"nm gf. F Ohr + oe, + 0; 


Expressions for riii Trimi Thi mip MAY be similarly obtained from 
(175), (177), and (178), in the last two cases by interchanging A and B. 

Estimates of any of the preceding correlation coefficients may be obtained 
by substituting for the variance components in the right-hand side of the 
expression (180) the estimates of these components obtained from the sample 


GROUPS (OF OBSERVATIONS) WITHIN SUBJECTS 381 


(170). It is left to the student to determine the simplest expression for each n 
in terms of the mean squares, and to show how these expressions simplify 
when n’ = n or a’ = a or b' = b, etc. 

Possible applications of this design are suggested in the following para. 
graphs: 


Each of the s subjects performs a task n times for each of a observer: 
on each of b days, the observers and days being the same for each subject, 
and regarded as randomly selected from populations of observers and days. 


Each of s workers produces a certain product on each of a machines 
on each of b days, the machines and days being randomly selected but 
the same for all workers. 


Each of s pilots performs a certain maneuver a times on each of b’ (ran- 
dom) airplane rides, being accompanied by a different observer on each 
ride, the same (random) observers being used for all pilots. 


Each of s subjects responds to each of a’ items on each of b equivalent 
forms of a performance test. The items are not regarded as randomly 
selected, but the forms are a random sample from a population of forms. 


Each of s randomly selected subjects responds n’ times to each ofa 
items in each of b parts of a performance test, neither A nor B being re- 
garded as random effects. 


It is left as an exercise for the student to suggest further applications 
and to discuss the meanings of the various variance components and of the 
possible reliability coefficients in each illustrative situation, as well as to 
suggest how the information may be employed in planning measurement 
schedules. 


Groups (of Observations) Within Subjects 


In some reliability studies, the design employed will be similar to the 
groups-within-treatments design discussed in Chapter 7. Suppose, for 
example, that each of s subjects performs a task n times independently on 
each of a occasions, the sample of “occasions” for each subject being regarded 
as independent of that for any other subject. The analysis of the total sum 
of squares will then be into between-subjects, between-occasions-within-sub- 
jects, and within-occasions. In more general terms, the analysis of the total 
sum of squares will be into between-subjeets, between-groups-within-subjects, 
and within-groups. 

We will let X;; represent the kth observation (k=1, 2,...n) in the 
jth group for the ith subject, M;; represent the mean of the n observations 
in the jth group (j = 1, 2, . . . a) for the ith subject, M; represent the mean 


382 ESTIMATION OF VARIANCE COMPONENTS IN RELIABILITY STUDIES 


of the an observations for the ith subject (i = 1, 2, ... s) and M represent 
the general mean of the ans observations. We will let xi; mj, and u represent 
the corresponding true means, respectively. We may then write the identity 


Xin = n (Mi — u) + Qu — pi) (C— nij) 
from which it follows that 
ox = Tur F Cave R BE Stimi 
= co; + a + ci. 
The expected values of the mean squares obtained from the sample are 
E(mss) = ot + noz + ano, 
E(msoes) = o: + noz 


E(ms,) = ci. 
From these it follows that 


est'd o? = ms, | 


2 _ MSGws — MS, 
est'd oz = —5. —"- 
n 
17 „2 _ MSs — MSquws 
est'd o, = ——— S, 


an 


From the preceding, it is obvious that the variance of the mean (M) of n’ 
observations in each of a^ groups is 
2 2 
Oa 9, 
EE a Uy 
and that the reliability of such a mean score is given by 
note 
a'n'c, (181) 


Lat Med ager pe aT ES TES 
n'as to, + a'n'a; 
From (181) various reliability coefficients may be obtained by substituting 
various combinations of values for a’ and n’, 
Tt is again left as an exercise for the student to suggest various applications | 
of this design and to discuss the meanings of the reliability coefficients that 
may be obtained with each, as well as to suggest how the information may be ] 
employed in the planning of a measurement schedule. 


J-R-73210/6987654 


Appendix 


TABLE OF RANDOM NUMBERS 


How to Use the Table of Random Numbers. The 
basic operation in most uses of this table is the ar- 
rangement of a number (J) of serially-numbered items 
in random order. For example, if for a group of 120 
subjects one wishes to assign 40 subjects at random 
lo each of three treatment groups, one would first 
arrange the subjects in random order and then assign 
the first 40 in this order to treatment 1, the second 40 
to treatment 2, etc. The simplest procedure, particu- 
larly when JN is large, is to (1) enter the serial numbers 
on cards or slips of paper, (2) make a “blind” selection 
of a page of the table of random numbers and of an 
(x + 2)-digit row or column on this page (x being the 
number of digits in NV), (3) enter the random numbers 
in this row or column on the cards in the order in which 
they appear in the table, going on to adjacent rows (or 
columns) if necessary, (4) arrange the cards in order of 
the random numbers. The cards will then be arranged 
in random order, except for those consecutive cards 
bearing the same random number. Since the random 
numbers contain two digits more than N, there will be 
very few such ties, but the cards within each “tied” 
set may be arranged in random order by assigning new 
random numbers to them and rearranging them in the 
order of these new random numbers. 


APPENDIX 


Table of Random Numbers ' 


12345 67 8 910 1112 13 M 15 16 17 18 19 20 


1 03 47 43 73 86 36 96 47 36 61 46 98 63 71 62 33 26 16 80 45 
2 97 74 24 67 62 42 81 14 57 20 42 53 32 37 32 27 07 36 07 51 
3 16 76 62 21 66 56 50 26 71 07 32 90 79 78 53 13 55 38 58 59 
4 12 56 85 99 26 96 96 68 27 31 05 03 72 93 15 57 12 10 14 21 
5 55 59 56 35 64 38 54 82 46 21 31 62 43 90 90 06 18 44 32 53 
6 16 22 77 94 39 49 54 43 54 82 17 37 93 23 78 87 35 20 96 43 
1 84 42 17 53 31 37 24 55 06 88 77 04 74 47 67 21 76 33 50 25 
8 63 01 63 78 59 16 95 55 57 19 98 10 50 71 75 12 86 73 58 07 
9 

10 

11 


33 21 12 34 29 78 64 56 07 82 52 42 07 44 38 15 51 00 13 42 

57 60 86 32 44 09 47 27 96 54 49 17 46 09 62 90 52 84 77 27 

18 18 07 92 46 44 17 16 58 09 79 83 86 19 62 06 76 50 03 10 
12 26 62 38 97 75 84 16 07 44 99 83 11 46 32 24 20 14 85 88 45 
13 23 42 40 64 74 82 97 77 77 81 07 45 32 14 08 32 98 94 07 72 
14 52 36 28 19 95 50 92 26 11 97 00 56 76 31 38 80 22 02 53 53 
15 31 85 94 35 12 83 39 50 08 30 42 34 07 96 88 54 42 06 87 98 
16 70 29 17 12 13 40 33 20 38 26 13 89 51 03 74 17 76 37 13 04 
17 56 62 18 37 35 96 83 50 87 75 97 12 25 93 47 70 33 24 03 54 
18 99 49 57 22 77 88 42 95 45 72 16 64 36 16 00 04 43 18 66 79 
19 16 08 15 04 72 33 27 14 34 09 45 59 34 68 49 12 72 07 34 45 
20 31 16 93 32 43 50 27 89 87 19 20 15 37 00 49 52 85 66 60 44 
21 68 34 30 13 7 55 74 30 77 40 44 22 78 84 26 04 33 36 09 52 
22 74 57 25 65 7 59 29 97 68 60 71 91 38 67 54 13 58 18 25 27 
23 27 42 37 86 53 48 55 90 65 72 96 57 69 36 10 96 46 92 42 45 
24 00 39 68 29 61 66 37 32 20 30 77 84 57 03 29 10 45 65 04 26 
25 29 94 98 94 24 68 49 69 10 82 53 75 91 93 30 234 25 20 57 27 
26 16 90 82 66 59 83 62 64 11 12 67 19 00 71 74 60 47 21 29 68 
27 11 27 94 75 06 06 09 19 74 66 02 94 37 34 02 76 70 90 30 86 
28 35 24 10 16 20 33 32 51 26 38 79 78 45 04 91 16 92 53 56 16 
29 38 23 16 86 38 42 38 97 01 50 87 75 66 81 41 40 01 74 91 62 
30 31 96 25 91 47 96 44 33 49 13 34 96 82 53 91 00 52 43 48 85 
31 66 67 40 67 14 64 05 71 95 86 11 05 65 09 68 76 83 20 37 90 
15 73 88 05 90 52 27 41 14 86 22 98 12 22 08 
33 68 05 51 18 00 33 96 02 74 19 07 60 62 93 55 59 33 82 43 90 
34 20 46 78 73 90 97 51 40 14 02 04 02 33 31 08 39 54 16 49 36 
35 (64 19 58 97 79 15 06 15 93 20 01 90 10 75 06 40 78 78 89 62 
36 05 26 93 70 60 22 35 85 15 13 92 03 51 59 77 59 56 78 06 83 
37 07 97 10 88 23 09 98 42 99 64 61 71 62 99 06 51 29 16 93 15 
38 68 71 86 85 85 54 87 66 47 54 73 32 98 11 12 44 95 92 63 16 
39 14 65 52 68 74 87 37 78 22 4l 26 78 63 06 55 13 08 27 01 50 
40 17 53 77 58 71 71 59 36 50 72 12 41 94 96 26 44 95 27 36 99 


Al 90 26 59 21 19 23 41 61 33 12 96 93 02 18 39 07 02 18 36 07 
89 20 


42 41 23 52 55 99 31 52 23 69 96 10 47 48 45 88 13 41 43 


43 26 99 61 65 53 58 04 49 80 70 42 10 50 67 42 32 17 55 85 74 


d F. Yates, Statistical Tables for Biological, 


1 Reprinted from Table 33 of R. A. Fisher an 
y Oliver and Boyd Ltd., by permission of 


Agricultural, and Medical Research, published b; 


the authors and publishers. 
385 


386 


Crrnaneene 


APPENDIX 


Table of Random Numbers (continued) 


10 


11 


14 15 


86 
23 
74 
12 
94 


APPENDIX 


Table of Random Numbers (continued) 


3.45 


Gi Ti Bins d 


10 


11 


14 


15 


387 


36 


40 


33 


68 
21 
23 
28 
76 
53 
58 
35 
24 
96 
27 
86 
48 
89 
00 
53 
95 
88 
57 
43 
09 
66 
79 
82 
22 
13 
20 
66 
08 
61 
69 
60 
AT 
21 
06 
17 
98 
85 
32 
53 


65 
69 
02 
26 
71 
64 
54 
26 
70 
30 
33 


54 4 


95 
Al 


42 3 


21 
84 
17 
24 
59 
42 
26 
18 
06 
42 
46 
15 
44 
18 
68 


68 
13 
09 
73 
20 
07 
92 
99 
93 
18 
24 
22 
07 
29 
51 
33 
49 
65 
92 
98 
00 
57 
12 
31 
96 
85 
72 
91 
7 
37 
34 
n 
21 
10 
59 
33 
87 
72 
73 
19 
20 
85 
59 
72 
88 
49 
12 
79 
38 
47 


94 
79 
61 
37 
44 
10 
38 
53 
86 
46 
53 
06 
16 
70 
90 
35 
4l 
19 
09 
TT 
4l 
99 
59 
51 
1 
AT 
82 
36 
53 
27 
18 
20 
37 
64 
1 
13 
14 
87 
96 
96 
21 
43 
91 
68 
02 


23 
93 
87 
32 
90 
63 
70 
93 
52 
23 
63 
34 
39 
83 
12 
72 
31 
69 
94 
87 
86 
99 
52 
10 
83 
04 
32 
74 
85 
AT 
04 
99 
93 
81 
74 
99 
TT 
08 
07 
23 
14 
01 
50 
49 
94 
92 
1l 
61 
06 
49 


92 
bri 
25 
04 
32 
16 
96 
61 
TT 
34 
94 
72 
33 
63 
02 
61 
06 
02 
38 
68 
19 
90 
51 
96 
44 
66 
99 
43 


46 . 


39 
52 
45 
28 
92 
17 
19 
43 
62 
94 
68 
72 
99 
29 
27 
85 
4l 
40 
38 
57 


35 
55 
21 
05 
64 
35 
92 
28 
64 
27 
09 
52 
66 
51 
07 
47 
70 
83 
76 
07 
19 
31 
02 
46 
80 
08 
90 


53 


92 


34 


21 


43 


09 


57 
71 
24 
90 
99 
79 
79 
48 
05 
24 
AT 
65 
56 


52 3 


17 
45 
45 
90 
69 
62 
00 
08 
4T 


14 


27 


23 


51 
09 
93 
05 
61 
98 
45 
34 
28 
44 
91 
20 


13 


45 


21 


53 


45 


15 


Index 


A x B x C design; see Three-dimensional 
designs 
A X B x L design; see Three-dimensional 
designs 
A x B x R design; see Three-dimensional 
designs 
AX B X Sdesign; see Three-dimensional 
designs 
A X LX R design; see Three-dimensional 
designs 
Analysis of covariance 
Assumptions of test of treatments 
effect, 323-325, 328-330 
Basic formulas, 319-325 
Comparison with treatments X levels 
design, 333-334 
Control of more than one variable, 335- 
336 
Generalized procedure, 332-333 
Illustrative example, 325-327 
Nature and purpose, 317-319 
Test of homogeneity of regression, 330- 
331 
"Test of significance of differences be- 
tween individual pairs of treatment 
means, 327 
Test of treatments effect, 323 
Used to introduce an additional factor 
into a factorial experiment, 334— 
335 
Analysis of variance in double-entry ta- 
bles: analysis of total sum of 
squares, 110-114; case of one obser- 
vation per cell, 114; interaction, 
meaning of, 118-120; notation, 108- 
109; summary table, 115 
Analysis of variance in reliability studies 
Groups (of observations) within sub- 
jects: estimates of variance compo- 
nents, 382; expected values of mean 
389 


squares, 382; reliability coefficients, 
382; subject true scores, 382 
Introduction, 357-359 
One-dimensional design: confidence in- 
terval for reliability coefficient of an 
obtained score, 361-362; estimated 
error variance, 358-360; estimated 
variance of true scores, 359-360; reli- 
ability coefficient of an obtained 
score, 361-362; reliability coefficient. 
of the mean of k scores for each indi- 
vidual, 361; sampling distribution of 
the estimated error variance, 362; 
sampling distribution of the esti- 
mated variance of true scores, 362 
Two-dimensional design: error of obser- 
vation, 363; estimates of variance 
components, 364; expected values of 
mean squares, 364; illustration, 370— 
311; observer bias, 363; observer X 
subject interaction, 363, 368; ob- 
server true score, 362; reliability co- 
efficients, 364-370; sampling vari- 
ances of variance estimates, 364; sub- 
ject true score, 362-363; trait, vari- 
ance, 363; variance of observer bias, 
363; variance of observer X subject 
interaction effects, 363 
Three-dimensional design: A-factor 
bias, 373; applications, 381; B-factor 
bias, 373; estimates of variance com- 
ponents, 372; expected values of 
mean squares, 372; reliability coeffi- 
cients, 374-381; sampling variances 
of variance estimates, 373; subject 
true score, 376, 378, 380; trait vari- 
ance, 373; variances of true interac- 
tion effects, 374 


Bartlett test, 87-88 


390 


Behrens-Fisher test, 97 
Bias, defined, 2 


Chi-square, defined, 27 

Chi-square distribution: characteristics 
of, 27-31; mean of, 28; mode of, 28; 
table, 29 

Cochran-Cox test, 97-98 

Computational formulas; see subheading 
Summary table, under specific design 
titles 

Confounding: in Latin square designs, 
261-264; in treatments X levels de- 
sign, 146-147; in treatments X sub- 
jects design, 163-164; partial, 304 

Counterbalanced factor, defined, 352 

Counterbalancing: defined, 162-163; in 
treatments X subjects design, 162- 
163; in type II design, 276-277; in 
type IV design, 288; in type V design, 
292 

“Counting off” method, 129-132 

Criterion factor, counterbalancing of: in 
type II design, 277, 281; in type IV 
design, 288; in type V design, 292 

Criterion factor, defined, 277 


Degrees of freedom: defined, 28; rules for 
determining, 35-36 
Direct effects of treatments, defined, 163 


Effect, defined, 1 

Efficiency, defined, 5 

Error of observation, 363 

Errors in testing hypotheses (type I and 
type II): control of type I errors, 70- 
12; consequences of, 68-70; effect of 
precision on, 70; type I, defined, 66; 
type II, defined, 66 

Errors influencing treatment means: 
type G, 9-10; type R, 10-11; type S, 
9 

Expected value, defined, 58 

Experiment, defined, 100 

Experimental data, defined, 100 

Experimental designs: basic types, 7-8; 
essential characteristics, 6—7; illustra- 
tions, 12-26 


INDEX 


Experimental variable, defined, 1 


F-distribution: characteristics of, 37-44; 
table, 41-44 
F-ratio, defined, 39 
F-test: assumptions of, 72-86; see also 51, 
footnote 1 
Factor, defined, 207-208 
Factorial design (two-factor) 
Factor, defined, 207-208 
Generalized case, 207-208 
Illustration, 20-23 
Meaning of msa(or 5)/msaz, 214-216 
Summary table, 208 
Test of AB interaction, 209-211 
"Test of differences between individual 
ireatment means, 214 
Test of main effects of treatments, 211- 
213 
"Test of simple effects of treatments, 
213-214 
Transformations, 216 


Graeco-Latin square, defined, 264-265 
Groups-within-treatments design 
Analysis of variance, 174-176 
Expected values of mean squares, 182- 
185 
Generalized case, 172-174 
Groups considered as random samples, 
182 
Illustration, 23-25 
Limitations and advantages, 197 
Meaning of F = ms@wa/mswe, 185-186 
Precision of means and differences be- 
tween pairs of means, 186-187 
Summary table, 177 
Treatments effects: interpretation of 
test of, 178-181; test of, 178 


Higher-dimensional designs 
Analysis of total sum of squares, 254 
Computational procedures, 254-255 
Incomplete factorial designs, defined, 
257 
Interaction, interpretation of, 255-256 
Limitations, 256 
Notation, 256 
Homogeneity of variance, test for, 87-88; 
see also Norton study 


INDEX 


Incomplete factorial designs: defined, 257; 
see Chapters 12 and 13 

Independence, defined, 31 

Independence of mean and variance, 
proof of, 31-35 

Interaction: generalized meaning in dou- 
ble-entry tables, 118-120; in higher- 
dimensional designs, 255-256; in 
three-dimensional designs, 223, 228- 
230; see also subheadings under spe- 
cific design titles 

Interaction effect, 118 

Investigation, defined, 100 


Latin square, defined, 259 
Latin square and Graeco-Latin square 
designs 
Analysis in simple Latin square designs, 
260-261 
Confounding in Latin square designs, 
261-264 
Graeco-Latin squares, defined, 264- 
265 
Latin square, defined, 259 
Orthogonal Latin squares, table of, 265 
Uses of, 264 
Levels, constitution of: "counting off" 
method, 129-132; representative 
sampling method, 128-129 


Main effect, defined, 15, 122; see also sub- 

headings under specific design titles 
Mean square, defined, 54; ratio, 54-66 
“Mixed” designs 

Additional two- and three-factor de- 
signs, 301-304 

Defined, 267 

Higher-order designs, 305-306 

Partial confounding in, 304 

Summary of, 802 (table) 

Type I design: analysis of total sum of 
squares, 268-269; diagram, 268; sum- 
mary table, 269; test of A effect, 270- 
271; test of AB interaction, 271; test 
of B effect, 270; test of simple effects 
of A, 271; test of simple effects of B, 
271-272; uses of, 273 

Type II design: analysis of total sum of 
squares, 277-278: diagram, 273; im- 


391 


portance of, 276-277; meaning of AB 
interaction, 274-276; summary table, 
278; test of AB interaction, 279-280; 
test of main effects, 279; test of sim- 
ple effects, 280-281; uses of, 281 

Type III design: analysis of total sum 
of squares, 282-283; diagram, 282; 
summary table, 284; tests of signifi- 
cance, 283-284; uses of, 284 

Type IV design: analysis of total sum of 
squares, 286-287; diagram, 285; sum- 
mary table, 287; tests of significance, 
287-288; uses of, 288 

Type V design: analysis of total sum of 
squares, 290-291; diagram, 289; sum- 
mary table, 291; tests of significance, 
291-292; uses of, 292 

Type VI design: analysis of total sum of 
squares, 293-296; diagram, 292; 
“pseudo replications,” 293-296; sum- 
mary table, 296; tests of significance, 
296-297 

Type VII design: analysis of total sum 
of squares, 298-300; diagram, 297; 
summary table, 300; tests of signifi- 
cance, 300-301 

Monotonic relationship: defined, 99; test 

for, 99-100 


Norton study, 78-86 
Null hypothesis, defined, 1 


Objectivity coefficient, 369 

Observational data, defined, 100 

Observer bias, 363 

Observer X subject interaction, 363, 368 

Observer true score: defined, 362; reliabil- 
ity of, 368 

Order effects, counterbalancing of: in 
treatments X subjects design, 162- 
163; in type II design, 276-277; in 
type IV design, 288; in type V design, 
292 

Order effects, defined, 162-163 

Orthogonal Latin squares: defined, 265; 
table of, 265 


Partial confounding, defined, 304 


392 INDEX 


Precision: defined, 2; effect on type II 
errors, 70; importance of, 2-5 
Pseudo replications, 293-296 


Random numbers, table of, 387-389 
Random replications design 
Advantages and limitations, 200-201 
Confidence interval for treatment popu- 
lation mean, 198 
Generalized case, 190 
Illustration, 17-20 
Population consisting of finite groups, 
191-195 
Precautions in planning and adminis- 
tering experiment, 198-200 
Replication of simple-randomized de- 
sign, 195-197 
Replication of treatments X levels de- 
sign, 197 
Replication of treatments X subjects 
design, 197 
"Simple" replications, 197 
Summary table, 191 
"Test of treatments effects, 191-195 
Testing differences between individual 
treatment means, 197-198 
Use of ms47, as an error term, 201-202 
Randomization, principle of, 11-12 
Region of rejection, 65-66 
Reliability coefficients; see Analysis of 
variance in reliability studies 
Reliability of an obtained score, 361, 365 
Repeatable factor, defined, 352 


Sequence effects, counterbalancing of: in 
treatments X subjects design, 162- 
163; in type IV design, 288 

Sequence effects, defined, 162-163 

Simple effect, defined, 15, 122: see also 
subheadings under specific design 
titles 

Simple-randomized design 

Applications, 98-101 

Categorical treatments, 98 
Computational procedures, 55-56 
Critical differences, 93-94 

Expected values of mean squares, 58-64 
Hypothesis tested, 47-48 

Illustration, 12-13 


Limitations of t-test, 48-49 
Single-factor treatments, 98 
Summary table, 56 
"Test of overall null hypothesis: mean 
square ratio, 54-66; measure of dis- 
crepancy, 49-51; sampling distribu- 
tion of the measure of discrepancy, 
51-54 
"Testing differences between individual 
treatment means, 90-96 
Transformations, 88-90 
Simple random replication, defined, 19, 
197 
Spearman-Brown formula, 361 


t-distribution: characteristics of, 37; table, 
38 
tratio: defined, 37; limitations of t-test, 
48-49 
Testing hypotheses, 5-6 
Three-dimensional designs 
A X B x C design, 243 
A X B X L design: test of AB in- 
teraction, 240-241; test of ABL 
interaction, 240; test of AL and BL 
interactions, 241-242; test of main 
effects of A and B, 242-243 
A X B X R design: simple effects, de- 
fined, 237; test of AB interaction, 
231-234; test of ABR interaction, 
230-231; test of AR and BR interac- 
lions, 234-235; test of main effects of 
A and B, 235-236; test of simple 
effects of A and B, 236-237 
A X B X S design: test of AB interac- 
tion, 237-238; test of main effects of 
A and B, 238 
A X L x R design, 238-239 
Analysis of total sum of squares, 220- 
225 
Computational procedures, 225-228 
Summary table, 226 
Test of differences between individual 
pairs of means, 243-244 
Triple interaction: defined, 223; mean- 
ing of, 228-230 
Trait variance, 363, 373 
Transformations: in factorial (two-factor) 
designs, 216; in simple-randomized 


INDEX 393 


design, 88-90; in treatments x levels 
design, 149-151 
Treatment: categorical, 98; defined, 1; 
single-factor, 99 
"Treatment effect, defined, 122 
‘Treatment population, defined, 48 
"Treatments effect, defined, 122 
Treatments X levels design 
Confounding, 146-147 
Constituting levels: '"'counting off" 
method, 129-132; representative 
sampling method, 128-129 
Generalized case, 121-123 
Illustration, 13-16 
Interaction: heterogeneity of, 126; in- 
teraction effect, 126; interpretation 
of test of, 139-141; intrinsic versus 
extrinsic, 124, 140; meaning of, 16, 
123-124; multiple-treatment designs, 
125-126; test of, 138-139; two- 
treatment designs, 15-16, 119, 125 
Limitations and advantages, 147-148 
Main effect of treatments: defined, 15, 
122; interpretation of test of, 135- 
137; test of, 133-135 
Meaning of msa/msaz, 141-144 
Missing cases, 148 
One observation per cell, 145-146 
Purposes of, 122 
Simple effect of treatments, defined, 15, 
122 
Summary table, 123 
Tests of significance of individual differ- 
ences, 146 
Transformations, 149-151 
Treatment effect, defined, 122 
Treatments effect, defined, 122 
Treatments X subjects design 
Confidence interval for treatment 
means, 166-167 
Confounding, 163-164 
Counterbalancing, 162-163 
Critical difference, 166-167 
Generalized case, 156 
Illustration, 17 
Limitations and advantages, 160-163 
Order effects, defined, 162-163 


Sequence effects, defined, 162-163 

Summary table, 157 

Testing differences between individual 
treatment means, 164-166 

Treatments effects: interpretation of 
test of, 159-160; test of, 157-158 


Trend analysis 


Comparison of trends in simple factorial 
and treatments X levels designs: test 
of coincidence of trend lines, 351; test. 
of parallelarity of trend lines, 350- 
351 

Designs appropriate for trend compari- 
sons, 351-354 

Simple-randomized design: test for 
goodness of fit to a priori trends, 344— 
346; test for goodness of fit to curves 
fitted to observed means, 346-347; 
test for linear trend, 343-344; test for 
presence of trend, 341-342 

Treatments X levels design: tests for 
goodness of fit, 348; test. for linear 
trend, 348; tests for presence of trend, 
341-348 

Treatments X subjects: tests for good- 
ness of fit, 348; test for linear trend, 
348-349; tests for presence of trend, 
348 

Trend hypotheses, 340-341 

Type II design: tests for goodness of fit, 
349; test for linear trend, 349; tests 
for presence of trend, 349 

Trend factor, defined, 351 

Trend hypotheses: 340-341; for tests see 
Trend analysis 

Triple interaction: defined, 223; meaning 
of, 228-230 

True score, 359, 362-363, 365, 376, 378, 
380, 382 

Type I error: consequences of, 68-70; de- 
fined, 66-68 

Type II error: control of, 70-72; conse- 
quences of, 68-70; defined, 66-68; 
effect of precision on, 70 

Type G error, defined, 9-10 

Type R error, defined, 10-11 

Type S error, defined, 9 


